gharchive.org icon indicating copy to clipboard operation
gharchive.org copied to clipboard

Only 1 line for 1 hour

Open AntoineAugusti opened this issue 6 years ago • 2 comments
trafficstars

Hello,

Thanks for this awesome project first! 👏

I've noticed that some extracted JSON files only have a single line for 1 hour of log. Here are the files for 2015 for examples:

  • 2015-01-05-20
  • 2015-01-06-4
  • 2015-01-07-13
  • 2015-01-08-2
  • 2015-01-09-11
  • 2015-01-09-9
  • 2015-02-03-16

This seems like a bug to me

AntoineAugusti avatar Mar 28 '19 16:03 AntoineAugusti

Hmm, thanks for reporting this! Sadly, unfortunately I can't precisely diagnose what may have caused this.. the gzip archives are the source of truth. Given similar gaps around that date range (https://github.com/igrigorik/gharchive.org/issues/175), my bet is on intermittent GH API downtime.

igrigorik avatar Mar 29 '19 03:03 igrigorik

Thanks for your quick reply. I thought about a GH API downtime but the fact that a single event was recorded during an hour everytime is weird.

I checked on BigQuery if it happened a lot and I only found the 7 cases reported originally. I did the check for all years.

Sample query for 2015 (I did it also for 2016-2019)

SELECT
  concat(date(created_at), '-', string(HOUR(created_at))),
  COUNT(1)
FROM
  [githubarchive:year.2015]
GROUP BY 1
HAVING count(1) = 1
ORDER BY 1 ASC

AntoineAugusti avatar Mar 29 '19 13:03 AntoineAugusti