Data Model

How the database supports the analysis

The project uses five normalized tables to separate owners, repositories, metrics, languages, and companies. This structure makes downstream ranking, lifecycle, and cross-entity reporting easier to query.

Schema

Entity structure and analytical role

Entity relationship diagram for the project schema

Tables

The schema includes developers, repositories, repository metrics, languages, and companies.

Analytical role

The relational design separates profile information, repository metadata, and performance metrics so that queries can isolate ownership, repository activity, and language signals more clearly.

Source mismatch

The repository metadata file has much wider coverage than the separate GitHub users file. That mismatch is why the live analysis relies on the full repository source.

Methodology

How the SQL analysis was designed

Popularity ranking

Query 1 uses DENSE_RANK() over stars and forks to rank repositories by visibility across owners and languages.

Active lifespan

Query 2 uses DATEDIFF() to measure the number of days between repository creation and last push.

Impact classification

Query 3 uses CASE WHEN logic to classify repositories into impact tiers using star thresholds.

Developer productivity

Query 4 uses COUNT, SUM, and AVG with GROUP BY to compare owners by total repositories, stars, forks, and average commits.

Above-average developers

Query 5 uses two CTEs and a CROSS JOIN to identify developers whose repository counts exceed the sample mean.

Reusable SQL view

Queries 6 and 7 use a five-table join and a reusable view to precompute active_days and impact_level for downstream reporting.

SQL Snippets

Click a method above to jump directly here

Query 1: Popularity ranking

DENSE_RANK over stars and forks
SELECT
    r.repo_name,
    d.login AS owner_login,
    l.language_name,
    m.stars,
    m.forks,
    DENSE_RANK() OVER (ORDER BY m.stars DESC, m.forks DESC) AS popularity_rank
FROM repositories r
JOIN developers d ON r.developer_id = d.developer_id
JOIN languages l ON r.language_id = l.language_id
JOIN repository_metrics m ON r.repo_id = m.repo_id
ORDER BY popularity_rank, r.repo_name;

Query 2: Active lifespan

DATEDIFF between creation and last push
SELECT
    r.repo_name,
    d.login AS owner_login,
    r.created_date,
    r.pushed_date,
    DATEDIFF(r.pushed_date, r.created_date) AS active_days
FROM repositories r
JOIN developers d ON r.developer_id = d.developer_id
WHERE r.created_date IS NOT NULL
  AND r.pushed_date IS NOT NULL
ORDER BY active_days DESC;

Query 3: Impact classification

CASE WHEN star thresholds
SELECT
    r.repo_name,
    d.login AS owner_login,
    m.stars,
    m.forks,
    CASE
        WHEN m.stars >= 10 THEN 'High Popularity'
        WHEN m.stars >= 3 THEN 'Medium Popularity'
        WHEN m.stars >= 1 THEN 'Low Popularity'
        ELSE 'No Stars Yet'
    END AS popularity_category
FROM repositories r
JOIN developers d ON r.developer_id = d.developer_id
JOIN repository_metrics m ON r.repo_id = m.repo_id
ORDER BY m.stars DESC, m.forks DESC;

Query 4: Developer productivity

COUNT, SUM, AVG with GROUP BY
SELECT
    d.login AS owner_login,
    COUNT(r.repo_id) AS total_repositories,
    SUM(m.stars) AS total_stars,
    SUM(m.forks) AS total_forks,
    ROUND(AVG(m.commit_count), 2) AS avg_commits
FROM developers d
JOIN repositories r ON d.developer_id = r.developer_id
JOIN repository_metrics m ON r.repo_id = m.repo_id
GROUP BY d.developer_id, d.login
ORDER BY total_stars DESC, total_repositories DESC;

Query 5: Above-average developers

CTEs and CROSS JOIN
WITH developer_repo_counts AS (
    SELECT
        d.developer_id,
        d.login,
        COUNT(r.repo_id) AS repo_count
    FROM developers d
    LEFT JOIN repositories r ON d.developer_id = r.developer_id
    GROUP BY d.developer_id, d.login
),
average_repo_count AS (
    SELECT AVG(repo_count) AS avg_repo_count
    FROM developer_repo_counts
)
SELECT
    drc.login,
    drc.repo_count,
    ROUND(arc.avg_repo_count, 2) AS sample_average_repo_count
FROM developer_repo_counts drc
CROSS JOIN average_repo_count arc
WHERE drc.repo_count > arc.avg_repo_count
ORDER BY drc.repo_count DESC;

Queries 6 and 7: Reusable SQL view

Five-table join and analysis view
CREATE OR REPLACE VIEW repository_analysis_view AS
SELECT
    d.login AS owner_login,
    d.full_name,
    c.company_name,
    d.location,
    d.followers,
    d.public_repos,
    r.repo_name,
    r.github_url,
    l.language_name,
    r.has_readme,
    r.created_date,
    r.updated_date,
    r.pushed_date,
    DATEDIFF(r.pushed_date, r.created_date) AS active_days,
    m.stars,
    m.forks,
    m.watchers,
    m.open_issues_count,
    m.contributors_count,
    m.release_count,
    m.commit_count,
    m.file_count,
    m.community_health_percentage,
    CASE
        WHEN m.stars >= 10 THEN 'High Impact'
        WHEN m.stars >= 3 THEN 'Medium Impact'
        WHEN m.stars >= 1 THEN 'Low Impact'
        ELSE 'Early Stage'
    END AS impact_level
FROM developers d
LEFT JOIN companies c ON d.company_id = c.company_id
JOIN repositories r ON d.developer_id = r.developer_id
JOIN languages l ON r.language_id = l.language_id
JOIN repository_metrics m ON r.repo_id = m.repo_id;

Data Sources

Where the model comes from

Repository dataset

Repository name, owner, URL, stars, forks, watchers, commit count, contributors, primary language, files, README size, health score, and dates.

User dataset

Login, name, company, location, followers, following, public repo count, account creation date, and profile-level context fields.

ETL challenge

The natural join key, owner login, is sparse across the two sources, so the presentation sample and the live analytics layer use different coverage strategies.