gharchive.org icon indicating copy to clipboard operation
gharchive.org copied to clipboard

Recession of Github's new repositories according to BigQuery open data

Open portuiu opened this issue 7 years ago • 1 comments
trafficstars

Good day!

I have an issue, linked to your open data, which is available through Google BigQuery. Why is there so huge recession of new GitHub new repositories since November of 2017. What's the reason of it? I guess, it's the new methodology of gathering repositories or not support of new appearing repositories.

There is the query, which gathering all repositories by years:

(WITH t1 AS (SELECT repo_name FROM bigquery-public-data.github_repos.commits, UNNEST(difference) as diff, UNNEST(repo_name) as repo_name WHERE DATE(author.date) < DATE("2018-01-01") GROUP BY repo_name), t2 AS (SELECT repo_name FROM bigquery-public-data.github_repos.commits, UNNEST(difference) as diff, UNNEST(repo_name) as repo_name WHERE DATE(author.date) <= DATE("2018-10-01") GROUP BY repo_name)

SELECT "ALL 2018", count(repo_name) FROM (SELECT repo_name FROM t2 WHERE repo_name NOT IN (SELECT repo_name FROM t1)))

UNION ALL

(WITH t1 AS (SELECT repo_name FROM bigquery-public-data.github_repos.commits, UNNEST(difference) as diff, UNNEST(repo_name) as repo_name WHERE DATE(author.date) < DATE("2017-01-01") GROUP BY repo_name), t2 AS (SELECT repo_name FROM bigquery-public-data.github_repos.commits, UNNEST(difference) as diff, UNNEST(repo_name) as repo_name WHERE DATE(author.date) < DATE("2018-01-01") GROUP BY repo_name)

SELECT "ALL 2017", count(repo_name) FROM (SELECT repo_name FROM t2 WHERE repo_name NOT IN (SELECT repo_name FROM t1)))

UNION ALL

(WITH t1 AS (SELECT repo_name FROM bigquery-public-data.github_repos.commits, UNNEST(difference) as diff, UNNEST(repo_name) as repo_name WHERE DATE(author.date) < DATE("2016-01-01") GROUP BY repo_name), t2 AS (SELECT repo_name FROM bigquery-public-data.github_repos.commits, UNNEST(difference) as diff, UNNEST(repo_name) as repo_name WHERE DATE(author.date) < DATE("2017-01-01") GROUP BY repo_name)

SELECT "ALL 2016", count(repo_name) FROM (SELECT repo_name FROM t2 WHERE repo_name NOT IN (SELECT repo_name FROM t1)))

Appreciate for your help!

portuiu avatar Oct 19 '18 11:10 portuiu

Hmm, we didn't change anything in what or how we collect. It's plausible that this reflects reality, but I don't have any other data sources to corroborate that against.

igrigorik avatar Oct 19 '18 18:10 igrigorik