gharchive.org Missing events

trafficstars

Hi, there was an issue reported for DevStats. I've did a full investigation and found that events are missing in GHA JSONS, all details are here.

GHA archives JSON is missing this PR opened event - there should be PullRequestEvent event with pranav-pandey0804 as an author but archives only have comments and reviews. This shoudl be in 2023-10-18-4 file but is not. See for example my PR - it has a correct PullRequestEvent event with lukaszgryglicki as an author.

Other missing events are:

This issue is missing IssuesEvent issue-opened event - it should be for the same author pranav-pandey0804.
This issue is missing 2 comments from pranav-pandey0804 author.
This issue is missing 3 comments from pranav-pandey0804 author.

cc @pranav-pandey0804 @igrigorik

Nov 20 '23 09:11 lukaszgryglicki

cc @caniszczyk

Nov 24 '23 06:11 lukaszgryglicki

I wonder if CNCF or Linux Foundation has plans to take on or sponsor the archiving effort of @igrigorik ? Since he seems inactive recently.

Dec 28 '23 03:12 jiagengliu

I can work on this, but I need all the details about deployment(s) and permissions. cc @caniszczyk

Dec 28 '23 05:12 lukaszgryglicki

@lukaszgryglicki would appreciate any help! Please ping me via email (see profile).

Feb 02 '24 06:02 igrigorik

@igrigorik email sent.

Feb 02 '24 06:02 lukaszgryglicki

Adding, another case of some of the events being missed , StarWarsAdi3 , this repo should have come in 2024-02-29-15.json but was missed.

Feb 29 '24 17:02 adityasethCSEK

This is pretty easy to explain/fix, you will need to scrape all 3 pages of the events API on each execution of the scraper to obtain complete coverage instead of just the first page of events.

To explain, the events API can return up to 100 events per page (when ?per_page=100), with a limit of 300 total events (3 pages). All 3 pages are replaced at the same moment in time on the GitHub side as far as I can tell. Sometimes, an event can only be found on page 2 or 3 but is never seen on page 1, the most obvious case being if there are more than 100 events in a given second. This happens relatively often (graph from the last 48 hours):

A naive (but functional) implementation could be to perform a fetch for all 3 pages at the same time once per second then de-duplicate the returned events with those that have been seen already based on event ID. However, this will require more than 5,000 requests/hour which exceeds the limit for a single API token. You can either switch to using a GitHub app token (that is installed on a paid enterprise) which has a higher rate-limit of 15,000 or use multiple different tokens/accounts such as one per page.

May 05 '24 19:05 bored-engineer

gharchive.org gharchive.org copied to clipboard

Missing events

gharchive.org
gharchive.org copied to clipboard