gharchive.org icon indicating copy to clipboard operation
gharchive.org copied to clipboard

Missing events

Open lukaszgryglicki opened this issue 2 years ago • 7 comments
trafficstars

Hi, there was an issue reported for DevStats. I've did a full investigation and found that events are missing in GHA JSONS, all details are here.

GHA archives JSON is missing this PR opened event - there should be PullRequestEvent event with pranav-pandey0804 as an author but archives only have comments and reviews. This shoudl be in 2023-10-18-4 file but is not. See for example my PR - it has a correct PullRequestEvent event with lukaszgryglicki as an author.

Other missing events are:

  • This issue is missing IssuesEvent issue-opened event - it should be for the same author pranav-pandey0804.
  • This issue is missing 2 comments from pranav-pandey0804 author.
  • This issue is missing 3 comments from pranav-pandey0804 author.

cc @pranav-pandey0804 @igrigorik

lukaszgryglicki avatar Nov 20 '23 09:11 lukaszgryglicki

cc @caniszczyk

lukaszgryglicki avatar Nov 24 '23 06:11 lukaszgryglicki

I wonder if CNCF or Linux Foundation has plans to take on or sponsor the archiving effort of @igrigorik ? Since he seems inactive recently.

jiagengliu avatar Dec 28 '23 03:12 jiagengliu

I can work on this, but I need all the details about deployment(s) and permissions. cc @caniszczyk

lukaszgryglicki avatar Dec 28 '23 05:12 lukaszgryglicki

@lukaszgryglicki would appreciate any help! Please ping me via email (see profile).

igrigorik avatar Feb 02 '24 06:02 igrigorik

@igrigorik email sent.

lukaszgryglicki avatar Feb 02 '24 06:02 lukaszgryglicki

Adding, another case of some of the events being missed , StarWarsAdi3 , this repo should have come in 2024-02-29-15.json but was missed.

adityasethCSEK avatar Feb 29 '24 17:02 adityasethCSEK

This is pretty easy to explain/fix, you will need to scrape all 3 pages of the events API on each execution of the scraper to obtain complete coverage instead of just the first page of events.

To explain, the events API can return up to 100 events per page (when ?per_page=100), with a limit of 300 total events (3 pages). All 3 pages are replaced at the same moment in time on the GitHub side as far as I can tell. Sometimes, an event can only be found on page 2 or 3 but is never seen on page 1, the most obvious case being if there are more than 100 events in a given second. This happens relatively often (graph from the last 48 hours):

image

A naive (but functional) implementation could be to perform a fetch for all 3 pages at the same time once per second then de-duplicate the returned events with those that have been seen already based on event ID. However, this will require more than 5,000 requests/hour which exceeds the limit for a single API token. You can either switch to using a GitHub app token (that is installed on a paid enterprise) which has a higher rate-limit of 15,000 or use multiple different tokens/accounts such as one per page.

bored-engineer avatar May 05 '24 19:05 bored-engineer