gharchive.org
gharchive.org copied to clipboard
Only 1 line for 1 hour
Hello,
Thanks for this awesome project first! 👏
I've noticed that some extracted JSON files only have a single line for 1 hour of log. Here are the files for 2015 for examples:
- 2015-01-05-20
- 2015-01-06-4
- 2015-01-07-13
- 2015-01-08-2
- 2015-01-09-11
- 2015-01-09-9
- 2015-02-03-16
This seems like a bug to me
Hmm, thanks for reporting this! Sadly, unfortunately I can't precisely diagnose what may have caused this.. the gzip archives are the source of truth. Given similar gaps around that date range (https://github.com/igrigorik/gharchive.org/issues/175), my bet is on intermittent GH API downtime.
Thanks for your quick reply. I thought about a GH API downtime but the fact that a single event was recorded during an hour everytime is weird.
I checked on BigQuery if it happened a lot and I only found the 7 cases reported originally. I did the check for all years.
Sample query for 2015 (I did it also for 2016-2019)
SELECT
concat(date(created_at), '-', string(HOUR(created_at))),
COUNT(1)
FROM
[githubarchive:year.2015]
GROUP BY 1
HAVING count(1) = 1
ORDER BY 1 ASC