gharchive.org icon indicating copy to clipboard operation
gharchive.org copied to clipboard

Drastic Drop Off in Events After 2025-05-23

Open cskor opened this issue 6 months ago • 9 comments

I was working with dataset and noticed that after 2025-05-23, the size of the data and number of events per day is drastically lower than it was the previous month.

Is this drop off genuine, an issue with github, or something else?

cskor avatar Jul 16 '25 22:07 cskor

Hi @cskor

I checked the 0:00 AM json files for each day starting 2024 and confirm the drop in event count.

Here's a plot of the number of events per day (between 0:00 and 0:59 AM):

Image

The most frequent events are Push events, and it's quite improbable that this drop is a natural product of user habits:

Image

Most other event types have a similar drop in frequency. Calculating the Pearson correlation (for time window 2025-03 to 2025-08) between all-event count and events-by-type counts gives:

event type correlation
all events 1
Push 0.96566
Public 0.884655
Fork 0.85795
Member 0.841076
Gollum 0.830705
Watch 0.803054
Delete 0.794861
Create 0.769217
IssueComment 0.757183
Release 0.741124
file_bytes 0.723839
PullRequestReview 0.622
Issues 0.583528
CommitComment 0.542668
PullRequest 0.529304
PullRequestReviewComment 0.336829

And indeed, the PullRequestReviewComment frequency is the least affected:

Image

It's quite mysterious..

For comparison, here's the plot of the file size of each 0:00 json file. The drop is not as pronounced but clearly visible:

Image

Anybody any idea what has happened?

defgsus avatar Sep 19 '25 19:09 defgsus

Checking the event IDs reveals a general problem. Assuming that the event ID increases consecutively, then there are a lot of events missing in the gharchive. The following table shows, for each day's 0:00 file,

  • the number of events contained in the file
  • the number of missing events, according to the event IDs
  • the number of gaps between event IDs (e.g. if the event IDs would be 1000, 1001, 1003, 1004, 1010, that would be 5 events and 2 gaps)
  • and some statistics about the gap sizes
date events missing events gaps gap_min gap_max gap_median gap_mean
2025-05-01 00:00:00 195836 884771 156829 2 364 5 6.64164
2025-05-02 00:00:00 188895 752239 148355 2 269 4 6.07054
2025-05-03 00:00:00 158714 550915 116038 2 1425 4 5.74772
2025-05-04 00:00:00 160056 339527 106159 2 687 3 4.1983
2025-05-05 00:00:00 185746 550467 135778 2 288 4 5.05418
2025-05-06 00:00:00 210130 847538 163885 2 1027 4 6.17155
2025-05-07 00:00:00 205822 921961 164479 2 509 5 6.60535
2025-05-08 00:00:00 202408 886676 161288 2 454 5 6.49748
2025-05-09 00:00:00 197907 853474 156706 2 3465 5 6.44635
2025-05-10 00:00:00 181928 545850 131607 2 2263 4 5.14758
2025-05-11 00:00:00 160884 343227 106570 2 347 3 4.22068
2025-05-12 00:00:00 186491 678571 143199 2 1987 4 5.73866
2025-05-13 00:00:00 205596 923651 164296 2 2331 5 6.62188
2025-05-14 00:00:00 199618 905315 159720 2 2614 5 6.66814
2025-05-15 00:00:00 202628 922655 161352 2 1326 5 6.71828
2025-05-16 00:00:00 195380 898317 156977 2 3378 5 6.72261
2025-05-17 00:00:00 175883 574874 130902 2 328 4 5.39164
2025-05-18 00:00:00 164459 397948 111128 2 318 3 4.581
2025-05-19 00:00:00 189520 693660 145375 2 3968 4 5.77153
2025-05-20 00:00:00 201295 908386 161475 2 409 5 6.62556
2025-05-21 00:00:00 202491 960081 163093 2 3204 5 6.88671
2025-05-22 00:00:00 201915 946826 162858 2 2661 5 6.81382
2025-05-23 00:00:00 200297 936153 161015 2 4534 5 6.81408
2025-05-24 00:00:00 136470 649038 101954 2 1955 4 7.366
2025-05-25 00:00:00 126762 409312 88292 2 1146 3 5.6359
2025-05-26 00:00:00 141729 737433 109301 2 1883 4 7.74682
2025-05-27 00:00:00 145819 812968 113030 2 2423 4 8.19251
2025-05-28 00:00:00 128147 1066013 103511 2 4717 5 11.2986
2025-05-29 00:00:00 141831 1005926 114972 2 3236 5 9.74932
2025-05-30 00:00:00 139802 1012838 112481 2 5156 5 10.0045
2025-05-31 00:00:00 136427 665128 101807 2 1737 4 7.53323
2025-06-01 00:00:00 126670 558300 93552 2 1881 4 6.96781
2025-06-02 00:00:00 142442 854724 111757 2 5245 5 8.64807
2025-06-03 00:00:00 146974 982364 118939 2 3289 5 9.2594
2025-06-04 00:00:00 145280 1090679 118793 2 6180 5 10.1813
2025-06-05 00:00:00 141977 1059661 115996 2 4009 5 10.1353
2025-06-06 00:00:00 138954 1042006 113946 2 3280 5 10.1447
2025-06-07 00:00:00 133768 703670 101729 2 1611 4 7.91711
2025-06-08 00:00:00 124182 477034 87950 2 1619 4 6.42393
2025-06-09 00:00:00 137566 736635 106225 2 1562 4 7.93468
2025-06-10 00:00:00 147513 1085730 119825 2 3249 5 10.061
2025-06-11 00:00:00 141886 1055326 115766 2 3135 5 10.116
2025-06-12 00:00:00 149995 1086791 121983 2 5647 5 9.90937
2025-06-13 00:00:00 141327 1024520 115822 2 2970 5 9.84565
2025-06-14 00:00:00 134705 707686 103309 2 2411 4 7.8502
2025-06-15 00:00:00 130544 502204 95118 2 1790 4 6.27981
2025-06-16 00:00:00 141516 810298 111282 2 2017 4 8.28149
2025-06-17 00:00:00 148966 1143279 122101 2 3219 5 10.3634
2025-06-18 00:00:00 145205 1108496 119672 2 2914 5 10.2628
2025-06-19 00:00:00 142647 1063957 116932 2 3700 5 10.0989
2025-06-20 00:00:00 144915 878110 115151 2 2588 5 8.62573
2025-06-21 00:00:00 135653 650914 103376 2 1854 4 7.29658
2025-06-22 00:00:00 127888 446221 91186 2 1668 4 5.89354
2025-06-23 00:00:00 139195 788089 109329 2 1922 4 8.20843
2025-06-24 00:00:00 145776 1153355 119367 2 3057 5 10.6623
2025-06-25 00:00:00 145915 1081156 119898 2 2150 5 10.0173
2025-06-26 00:00:00 146200 1056579 119999 2 3293 5 9.80491
2025-06-27 00:00:00 144965 1043376 117723 2 8101 5 9.86298
2025-06-28 00:00:00 133919 674600 102176 2 2250 4 7.60234
2025-06-29 00:00:00 125954 462651 89119 2 3331 4 6.1914
2025-06-30 00:00:00 139577 803805 110802 2 1990 5 8.25444

Here's a plot of the ratio of missing IDs (missing events divided by events)

Image

Whatever the reason is, it seems gharchive is storing only a fraction of the actual events happening on github and this got much worse in late May.

Just checked events API myself and it delivers those ID gaps as expected. It's possible that all missing events are private and therefore not listed but that does not explain the sudden change in the timeline..

defgsus avatar Sep 20 '25 09:09 defgsus

Hehe, can not stop..

and did the counting of missing IDs for every Wednesday 0:00, from 2015-01-07 to 2025-09-17

Here's the plot of the count of recorded events in each json file:

Image

And here's the plot including the number of missing event IDs and the number of gaps between event IDs:

Image

Interestingly, there seems to be another strange discontinuity in August 2017, where the number of missing IDs is reduced to less than half.

Looking at the plot of the missing-ratio (the number of missing events divided by number of recorded events), i begin to imagine that github switched something in their backend in August 2017 and then switched it back in May 2025:

Image

Major real world events like microsoft buying github in 2018 (thelinuxcode.com @ 2025-01-09), or free access to private repos in 2019 (github.blog @ 2019-01-07) do not seem to be related to the plots' discontinuities. At least at first glance..

The github events API lists the SponsorshipEvent which i have never seen in any gharchive record. That might explain some missing event IDs, but sponsorship only started in mid 2019 (github.blog @ 2019-05-23).

Here is another hint: https://github.com/igrigorik/gharchive.org/issues/171 There were actual missed events because of API rate limits, but the date of fix is around February 2019. I don't actually see that in the timelines.

defgsus avatar Sep 20 '25 12:09 defgsus

Hi guys, the situation seems to get worse this week... Any clue?

Image

huentat avatar Oct 11 '25 16:10 huentat

I believe the github event stream is broken since the outage on the 8th - the authenticated endpoint seems to be caching the 3 pages of events for a whole 10 minutes instead of the normal 1-2 seconds. The unauthenticated endpoint, ironically, still refreshes every 10 seconds. I reached out to github support, but they said they didnt hear any other customers having this issue, so I just replied with this thread. Hopefully its just some switch they forgot to turn back on after the event last week

andrewortman avatar Oct 11 '25 21:10 andrewortman

By the way, the problem that caused the issue for this ticket is fairly straightforward -

The crawler.rb script only fetches the first page of events: (https://github.com/igrigorik/gharchive.org/blob/master/crawler/crawler.rb#L49)

The 3 pages of events change together. If you only poll for page 1, there is a high probability throughout the day you dont capture events because they will only show up on page 2 or page 3.

Here's the psuedocode I used to query github events:

ALGORITHM PollPages
INPUT:
  PAGES ← [1, 2, 3]
  SLEEP_MS ← 250
STATE:
  etag[1..3]    // last known ETags per page
  lastKnownId   // last processed id

// One-time init
FOR p IN PAGES:
  etag[p] ← FETCH_ETAG(p)         // e.g., HEAD or initial GET

LOOP FOREVER:
  // Step 2a: probe page 1 with conditional GET
  r1 ← GET(page=1, If-None-Match=etag[1])

  IF r1.status = NOT_MODIFIED THEN
    SLEEP(SLEEP_MS)
    CONTINUE LOOP
  ENDIF

  // Step 2b: page 1 changed → update and fetch 2 & 3 in parallel
  IF r1.status = OK THEN
    etag[1] ← r1.etag
  ENDIF

  PARALLEL:
    r2 ← GET(page=2, If-None-Match=etag[2])
    r3 ← GET(page=3, If-None-Match=etag[3])
  END PARALLEL

  IF r2.status = OK THEN etag[2] ← r2.etag ENDIF
  IF r3.status = OK THEN etag[3] ← r3.etag ENDIF

  // Step 2c: merge and filter
  items ← CONCAT(ITEMS(r1), ITEMS(r2), ITEMS(r3))
  newItems ← FILTER(items, item.id > lastKnownId)
  PROCESS(newItems)                // emit or handle as needed

  // Step 2d: advance watermark
  IF newItems ≠ ∅ THEN
    lastKnownId ← MAX(item.id FOR item IN newItems)
  ENDIF

  SLEEP(SLEEP_MS)
END LOOP

Rate limits for authenticated accounts are 5,000 requests per hour per the docs. Right now, the cached pages don’t change fast enough for you to hit that; I only use about 4,000 requests per hour since NotModified responses don’t count toward the limit. If you want to be safe, add a simple rate limit token/bucket system to make sure you never exceed the allowed rate.

During slow periods, you might only need to pull page 1. You can adjust the pseudocode so that if the last known ID is already present on page 1, you skip querying pages 2 and 3 to save on some bandwidth for unnecessary requests.

andrewortman avatar Oct 11 '25 21:10 andrewortman

I filed a separate issue in #312 for the 100x drop-off since 2025-10-09, not sure which issue to use to track this, so chiming in here as well 😃

filmaj avatar Oct 13 '25 08:10 filmaj

Indeed, @filmaj

$ du -sh 2025/*/*
1.6G	2025/01/01
1.9G	2025/01/02
2.0G	2025/01/03
1.4G	2025/01/04
1.4G	2025/01/05
2.6G	2025/01/06
2.2G	2025/01/07
2.5G	2025/01/08
2.3G	2025/01/09
2.4G	2025/01/10
1.6G	2025/01/11
1.5G	2025/01/12
2.9G	2025/01/13
2.5G	2025/01/14
2.5G	2025/01/15
2.5G	2025/01/16
2.4G	2025/01/17
1.5G	2025/01/18
1.5G	2025/01/19
2.8G	2025/01/20
2.6G	2025/01/21
2.6G	2025/01/22
2.5G	2025/01/23
2.4G	2025/01/24
1.5G	2025/01/25
1.6G	2025/01/26
3.0G	2025/01/27
2.5G	2025/01/28
2.4G	2025/01/29
2.3G	2025/01/30
2.2G	2025/01/31
1.8G	2025/02/01
1.5G	2025/02/02
2.9G	2025/02/03
2.6G	2025/02/04
2.6G	2025/02/05
2.5G	2025/02/06
2.5G	2025/02/07
1.7G	2025/02/08
1.7G	2025/02/09
3.1G	2025/02/10
2.8G	2025/02/11
2.7G	2025/02/12
2.6G	2025/02/13
2.5G	2025/02/14
1.7G	2025/02/15
1.6G	2025/02/16
3.0G	2025/02/17
2.6G	2025/02/18
2.7G	2025/02/19
2.7G	2025/02/20
2.5G	2025/02/21
1.9G	2025/02/22
1.6G	2025/02/23
3.1G	2025/02/24
2.7G	2025/02/25
2.6G	2025/02/26
2.7G	2025/02/27
2.5G	2025/02/28
2.0G	2025/03/01
1.7G	2025/03/02
3.2G	2025/03/03
2.7G	2025/03/04
2.6G	2025/03/05
2.6G	2025/03/06
2.6G	2025/03/07
1.7G	2025/03/08
1.7G	2025/03/09
3.2G	2025/03/10
2.8G	2025/03/11
2.8G	2025/03/12
2.7G	2025/03/13
2.5G	2025/03/14
1.7G	2025/03/15
1.7G	2025/03/16
3.1G	2025/03/17
2.7G	2025/03/18
2.6G	2025/03/19
2.7G	2025/03/20
2.6G	2025/03/21
1.7G	2025/03/22
1.7G	2025/03/23
3.2G	2025/03/24
2.9G	2025/03/25
2.7G	2025/03/26
2.7G	2025/03/27
2.5G	2025/03/28
1.7G	2025/03/29
1.7G	2025/03/30
3.1G	2025/03/31
3.1G	2025/04/01
2.8G	2025/04/02
2.7G	2025/04/03
2.5G	2025/04/04
1.7G	2025/04/05
1.7G	2025/04/06
3.2G	2025/04/07
2.8G	2025/04/08
2.7G	2025/04/09
2.7G	2025/04/10
2.5G	2025/04/11
1.6G	2025/04/12
1.6G	2025/04/13
3.0G	2025/04/14
2.6G	2025/04/15
2.5G	2025/04/16
2.4G	2025/04/17
2.2G	2025/04/18
1.6G	2025/04/19
1.6G	2025/04/20
2.7G	2025/04/21
2.7G	2025/04/22
2.6G	2025/04/23
2.8G	2025/04/24
2.5G	2025/04/25
1.7G	2025/04/26
1.7G	2025/04/27
3.0G	2025/04/28
2.7G	2025/04/29
2.7G	2025/04/30
2.6G	2025/05/01
2.2G	2025/05/02
1.7G	2025/05/03
1.7G	2025/05/04
3.1G	2025/05/05
2.8G	2025/05/06
2.7G	2025/05/07
2.7G	2025/05/08
2.5G	2025/05/09
1.7G	2025/05/10
1.7G	2025/05/11
3.1G	2025/05/12
2.8G	2025/05/13
2.7G	2025/05/14
2.7G	2025/05/15
2.6G	2025/05/16
1.8G	2025/05/17
1.7G	2025/05/18
3.1G	2025/05/19
2.7G	2025/05/20
2.8G	2025/05/21
2.7G	2025/05/22
2.4G	2025/05/23
1.3G	2025/05/24
1.3G	2025/05/25
1.8G	2025/05/26
1.7G	2025/05/27
1.7G	2025/05/28
1.7G	2025/05/29
1.8G	2025/05/30
1.3G	2025/05/31
1.5G	2025/06/01
2.1G	2025/06/02
1.8G	2025/06/03
1.9G	2025/06/04
1.8G	2025/06/05
1.8G	2025/06/06
1.4G	2025/06/07
1.3G	2025/06/08
2.0G	2025/06/09
2.0G	2025/06/10
1.9G	2025/06/11
1.7G	2025/06/12
1.8G	2025/06/13
1.6G	2025/06/14
1.7G	2025/06/15
2.2G	2025/06/16
1.9G	2025/06/17
1.9G	2025/06/18
1.9G	2025/06/19
1.7G	2025/06/20
1.4G	2025/06/21
1.3G	2025/06/22
2.2G	2025/06/23
2.7G	2025/06/24
2.1G	2025/06/25
1.8G	2025/06/26
1.9G	2025/06/27
1.4G	2025/06/28
1.3G	2025/06/29
2.0G	2025/06/30
2.1G	2025/07/01
1.9G	2025/07/02
1.8G	2025/07/03
1.7G	2025/07/04
1.3G	2025/07/05
1.3G	2025/07/06
2.0G	2025/07/07
1.9G	2025/07/08
1.8G	2025/07/09
1.8G	2025/07/10
1.8G	2025/07/11
1.4G	2025/07/12
1.3G	2025/07/13
2.0G	2025/07/14
2.0G	2025/07/15
1.9G	2025/07/16
2.0G	2025/07/17
2.0G	2025/07/18
1.4G	2025/07/19
1.4G	2025/07/20
2.1G	2025/07/21
2.1G	2025/07/22
1.9G	2025/07/23
1.9G	2025/07/24
1.8G	2025/07/25
1.4G	2025/07/26
1.4G	2025/07/27
2.1G	2025/07/28
2.0G	2025/07/29
1.9G	2025/07/30
1.9G	2025/07/31
2.1G	2025/08/01
1.4G	2025/08/02
1.4G	2025/08/03
2.1G	2025/08/04
2.1G	2025/08/05
2.0G	2025/08/06
2.1G	2025/08/07
2.1G	2025/08/08
1.6G	2025/08/09
1.7G	2025/08/10
2.4G	2025/08/11
2.3G	2025/08/12
2.1G	2025/08/13
2.0G	2025/08/14
1.9G	2025/08/15
1.5G	2025/08/16
1.6G	2025/08/17
2.3G	2025/08/18
2.1G	2025/08/19
2.0G	2025/08/20
2.1G	2025/08/21
2.0G	2025/08/22
1.4G	2025/08/23
1.5G	2025/08/24
2.2G	2025/08/25
2.1G	2025/08/26
2.1G	2025/08/27
2.0G	2025/08/28
1.9G	2025/08/29
1.5G	2025/08/30
1.5G	2025/08/31
2.2G	2025/09/01
2.2G	2025/09/02
2.0G	2025/09/03
2.1G	2025/09/04
1.9G	2025/09/05
1.5G	2025/09/06
1.5G	2025/09/07
1.7G	2025/09/08
2.1G	2025/09/09
2.2G	2025/09/10
1.9G	2025/09/11
1.9G	2025/09/12
1.6G	2025/09/13
1.5G	2025/09/14
2.2G	2025/09/15
2.1G	2025/09/16
2.0G	2025/09/17
1.9G	2025/09/18
1.9G	2025/09/19
1.4G	2025/09/20
1.5G	2025/09/21
2.4G	2025/09/22
2.1G	2025/09/23
2.0G	2025/09/24
1.9G	2025/09/25
1.9G	2025/09/26
1.5G	2025/09/27
1.4G	2025/09/28
2.1G	2025/09/29
2.0G	2025/09/30
2.3G	2025/10/01
2.0G	2025/10/02
1.9G	2025/10/03
1.5G	2025/10/04
1.5G	2025/10/05
2.4G	2025/10/06
2.0G	2025/10/07
1.5G	2025/10/08
5.1M	2025/10/09
4.7M	2025/10/10
3.2M	2025/10/11
3.3M	2025/10/12
2.1M	2025/10/13

ryjones avatar Oct 13 '25 10:10 ryjones

I discovered yesterday that the number of star events has started to recover, but it's still inconsistent with the official data. Has anyone encountered the same situation?

Ciannali avatar Oct 16 '25 09:10 Ciannali