Open-Source Repository Research

How visible, maintained, and documented are open-source repositories?

This project studies repository visibility, lifecycle, documentation, and community health using public GitHub metadata. It combines a normalized SQL schema, cleaned analytics data, an interactive web dashboard, and a companion Power BI report.

Overview

What this project investigates

Repository visibility

The project evaluates how stars and forks are distributed across repositories and whether a small number of projects dominate attention.

Maintenance signals

It studies active lifespan, commit activity, and repository longevity to understand how maintenance patterns vary.

Documentation and quality

It compares README coverage and community health metrics to see how well repositories are documented and structured.

Research Questions

What the analysis is trying to answer

How concentrated is visibility?

Are stars and forks spread broadly across repositories, or concentrated in a small set of highly visible projects and owners?

What does repository maturity look like?

Do most repositories remain early stage, or do they show sustained activity and stronger signs of long-term maintenance?

How strong are documentation signals?

Does README coverage remain high even among lower-impact repositories, and how does that compare with community health measures?

Data

How the datasets are used

Repository metadata source

The main analysis uses the full repository metadata file with 14,644 repositories, which provides stars, forks, commits, README status, and lifecycle fields.

GitHub users source

A separate GitHub users dataset supports the relational schema and helps illustrate joins, normalization, and cross-source entity matching.

Key limitation

Only about 200 repositories match across the two sources, so the public-facing analysis emphasizes repository-level patterns instead of full profile-level integration.

Why This Project Matters

What makes the analysis stronger than a simple repository dashboard

It evaluates drivers, not just totals

The project moves beyond counting stars and asks how documentation, sustained activity, and reuse signals interact with visibility.

It uses normalized and comparative metrics

The analytical framing emphasizes averages, adoption rates, concentration, lifecycle, and impact-tier share rather than only raw aggregates.

It treats open source as a data product

The final output is positioned as a decision-support artifact for understanding repository quality, adoption, and long-term maintenance signals.

Site Map

Explore the project by page

Analysis

Interactive charts, filters, KPI cards, and repository spotlight views built from the cleaned full dataset.

Data Model

Schema design, source integration strategy, SQL view definitions, and methodology notes from the project.

Findings

Main conclusions, data-quality challenges, and future work drawn directly from the final presentation.

Deliverables

SQL scripts, prepared BI datasets, published Power BI link, and supporting documents for presentation or review.