gharchive.org
gharchive.org copied to clipboard
Missing events
Hi, there was an issue reported for DevStats. I've did a full investigation and found that events are missing in GHA JSONS, all details are here.
GHA archives JSON is missing this PR opened event - there should be PullRequestEvent event with pranav-pandey0804 as an author but archives only have comments and reviews. This shoudl be in 2023-10-18-4 file but is not. See for example my PR - it has a correct PullRequestEvent event with lukaszgryglicki as an author.
Other missing events are:
- This issue is missing
IssuesEventissue-opened event - it should be for the same authorpranav-pandey0804. - This issue is missing
2comments frompranav-pandey0804author. - This issue is missing
3comments frompranav-pandey0804author.
cc @pranav-pandey0804 @igrigorik
cc @caniszczyk
I wonder if CNCF or Linux Foundation has plans to take on or sponsor the archiving effort of @igrigorik ? Since he seems inactive recently.
I can work on this, but I need all the details about deployment(s) and permissions. cc @caniszczyk
@lukaszgryglicki would appreciate any help! Please ping me via email (see profile).
@igrigorik email sent.
Adding, another case of some of the events being missed , StarWarsAdi3 , this repo should have come in 2024-02-29-15.json but was missed.
This is pretty easy to explain/fix, you will need to scrape all 3 pages of the events API on each execution of the scraper to obtain complete coverage instead of just the first page of events.
To explain, the events API can return up to 100 events per page (when ?per_page=100), with a limit of 300 total events (3 pages). All 3 pages are replaced at the same moment in time on the GitHub side as far as I can tell. Sometimes, an event can only be found on page 2 or 3 but is never seen on page 1, the most obvious case being if there are more than 100 events in a given second. This happens relatively often (graph from the last 48 hours):
A naive (but functional) implementation could be to perform a fetch for all 3 pages at the same time once per second then de-duplicate the returned events with those that have been seen already based on event ID. However, this will require more than 5,000 requests/hour which exceeds the limit for a single API token. You can either switch to using a GitHub app token (that is installed on a paid enterprise) which has a higher rate-limit of 15,000 or use multiple different tokens/accounts such as one per page.