gharchive.org
gharchive.org copied to clipboard
Recession of Github's new repositories according to BigQuery open data
Good day!
I have an issue, linked to your open data, which is available through Google BigQuery. Why is there so huge recession of new GitHub new repositories since November of 2017. What's the reason of it? I guess, it's the new methodology of gathering repositories or not support of new appearing repositories.
There is the query, which gathering all repositories by years:
(WITH t1 AS (SELECT repo_name FROM bigquery-public-data.github_repos.commits, UNNEST(difference) as diff, UNNEST(repo_name) as repo_name WHERE DATE(author.date) < DATE("2018-01-01") GROUP BY repo_name), t2 AS (SELECT repo_name FROM bigquery-public-data.github_repos.commits, UNNEST(difference) as diff, UNNEST(repo_name) as repo_name WHERE DATE(author.date) <= DATE("2018-10-01") GROUP BY repo_name)
SELECT "ALL 2018", count(repo_name) FROM (SELECT repo_name FROM t2 WHERE repo_name NOT IN (SELECT repo_name FROM t1)))
UNION ALL
(WITH t1 AS (SELECT repo_name FROM bigquery-public-data.github_repos.commits, UNNEST(difference) as diff, UNNEST(repo_name) as repo_name WHERE DATE(author.date) < DATE("2017-01-01") GROUP BY repo_name), t2 AS (SELECT repo_name FROM bigquery-public-data.github_repos.commits, UNNEST(difference) as diff, UNNEST(repo_name) as repo_name WHERE DATE(author.date) < DATE("2018-01-01") GROUP BY repo_name)
SELECT "ALL 2017", count(repo_name) FROM (SELECT repo_name FROM t2 WHERE repo_name NOT IN (SELECT repo_name FROM t1)))
UNION ALL
(WITH t1 AS (SELECT repo_name FROM bigquery-public-data.github_repos.commits, UNNEST(difference) as diff, UNNEST(repo_name) as repo_name WHERE DATE(author.date) < DATE("2016-01-01") GROUP BY repo_name), t2 AS (SELECT repo_name FROM bigquery-public-data.github_repos.commits, UNNEST(difference) as diff, UNNEST(repo_name) as repo_name WHERE DATE(author.date) < DATE("2017-01-01") GROUP BY repo_name)
SELECT "ALL 2016", count(repo_name) FROM (SELECT repo_name FROM t2 WHERE repo_name NOT IN (SELECT repo_name FROM t1)))
Appreciate for your help!
Hmm, we didn't change anything in what or how we collect. It's plausible that this reflects reality, but I don't have any other data sources to corroborate that against.