gharchive.org
gharchive.org copied to clipboard
only three issuesevent action type
On the api documentation page, https://developer.github.com/v3/activity/events/types/#issuesevent, it says the payload.action type for issues event can be one of "assigned", "unassigned", "labeled", "unlabeled", "opened", "edited", "milestoned", "demilestoned", "closed", or "reopened". But when we look at the past few years data, this value is only one of the three 'open', 'closed' and 'reopened'. Are the other action types not captured by the GHArchive? Thanks a lot!
Someone else flagged this to me too recently. Looks like the Events API may be surfacing a subset of issue transitions.. @annafil could you sanity check on your end?
Yes, that's correct @igrigorik.
This is a limitation on the API side, not GHArchive. While the documentation for the API says these events are surfaced, they are not available in the /events endpoint that GHArchive reads from -- only through webhooks. Will ask for this to be clarified in the docs :)
@annafil would it be possible to ask in reverse, and add those events to the API? I've heard a few requests for this now.. :)
If it helps, even though the /events stream doesn't include these types of events by default, they are currently available in a slightly different form from the API :)
Each issue event has a unique API URI, and contains the additional issue activity types above. As far as I can see historical information is still available for those events: e.g. this event from the very active rails/rails circa 2011 that also includes the 'assigned' event: https://api.github.com/repos/rails/rails/issues/411/events. It should therefore be possible to reconstruct activity for issues of particular interest if the repo and issue have not been deleted.
@igrigorik We could consider updating the crawler to fetch these related events whenever it encounters an issue, to attempt to preserve the historical data around issues better, but I defer to you on whether this is in scope for Archive :)
Thank you both for your help!
@igrigorik We could consider updating the crawler to fetch these related events whenever it encounters an issue, to attempt to preserve the historical data around issues better, but I defer to you on whether this is in scope for Archive :)
How would you see that working? Trigger the extra fetch when an issue is "closed" to backfill? A couple of gotchas that come to mind
- Presumably issues can be updated even after they're updated, right? We would still miss data.
- Today the activity is logged into the archive when it is detected, so the fetched data would be "misaligned" with the rest, and backfilling into old gzip archives and BQ tables would add a ton of complexity.
- We're already up against the API limit. More fetches might make us lose more activity data.
Similarly, it seems like for pull request events, only 'opened', 'closed', and 'reopened' are captured, not others such as 'assigned', 'unassigned', is it also expected?
@igrigorik Very good point about backfilling to the gzip archives and the added complexity. I agree with you that the API limit is a concern, and a big blocker to grabbing more of this data in some systematic way. I suspect one of the reasons these additional events are not available through the /events endpoint is because they're relatively higher in volume than open/closed/reopened events and would make it harder to keep up with the feed.
@bluecoco good question! I would expect a consistent set of events to be put out for both PRs and Issues, so this seems right to me.