gharchive.org
gharchive.org copied to clipboard
PushEvents missing
I am pulling data from the GH archive to extract SHA-1 commit hashes from PushEvent records. However, it seems that some PushEvent records are missing.
For example, the last two commits in this repository are:
44d03b5— feat: CATPPUCCINb145639— fix: stuff
The commits were made within minutes of each other. It is unclear whether the commits were pushed at the same time or in separate pushes. However, the first commit shows up in 2023-08-02-8.json.gz:
{
"id": "30838278409",
"type": "PushEvent",
"actor": {
"id": 59457929,
"login": "PassiHD2004",
"display_login": "PassiHD2004",
"gravatar_id": "",
"url": "https://api.github.com/users/PassiHD2004",
"avatar_url": "https://avatars.githubusercontent.com/u/59457929?"
},
"repo": {
"id": 616647657,
"name": "PassiHD2004/phoenixts.eu",
"url": "https://api.github.com/repos/PassiHD2004/phoenixts.eu"
},
"payload": {
"repository_id": 616647657,
"push_id": 14531916434,
"size": 1,
"distinct_size": 1,
"ref": "refs/heads/main",
"head": "44d03b57a63c8e0306c8846f8fba130355360de1",
"before": "5aa78df0c1e68abae8dce23f3746cd1f692cfb89",
"commits": [
{
"sha": "44d03b57a63c8e0306c8846f8fba130355360de1",
"author": {
"email": "[email protected]",
"name": "PassiHD"
},
"message": "feat: CATPPUCCIN\n\nSigned-off-by: PassiHD <[email protected]>",
"distinct": true,
"url": "https://api.github.com/repos/PassiHD2004/phoenixts.eu/commits/44d03b57a63c8e0306c8846f8fba130355360de1"
}
]
},
"public": true,
"created_at": "2023-08-02T08:55:40Z"
}
The second commit, or any other PushEvent record for the repository, is not included in any archive file up to 08/19/23, even though the commit is clearly visible on the GitHub website.
Since the missing commit was made at 10:59, could it have "fallen between the cracks" of two archives?
The missing event can be fetched from GitHub directly via https://api.github.com/repos/PassiHD2004/phoenixts.eu/events:
{
"actor": {
"avatar_url": "https://avatars.githubusercontent.com/u/59457929?",
"display_login": "PassiHD2004",
"gravatar_id": "",
"id": 59457929,
"login": "PassiHD2004",
"url": "https://api.github.com/users/PassiHD2004"
},
"created_at": "2023-08-02T08:59:53Z",
"id": "30838391059",
"payload": {
"before": "44d03b57a63c8e0306c8846f8fba130355360de1",
"commits": [
{
"author": {
"email": "[email protected]",
"name": "PassiHD"
},
"distinct": true,
"message": "fix: stuff\n\nSigned-off-by: PassiHD <[email protected]>",
"sha": "b1456399949384acf2d38b57f50f18f8006b6006",
"url": "https://api.github.com/repos/PassiHD2004/phoenixts.eu/commits/b1456399949384acf2d38b57f50f18f8006b6006"
}
],
"distinct_size": 1,
"head": "b1456399949384acf2d38b57f50f18f8006b6006",
"push_id": 14531968662,
"ref": "refs/heads/main",
"repository_id": 616647657,
"size": 1
},
"public": true,
"repo": {
"id": 616647657,
"name": "PassiHD2004/phoenixts.eu",
"url": "https://api.github.com/repos/PassiHD2004/phoenixts.eu"
},
"type": "PushEvent"
}
Given that the event was created at 2023-08-02T08:59:53Z chances are it got lost between two archives.
what query did you use to extract the sha commit hashes?
what query did you use to extract the sha commit hashes?
I am not sure I understand the question.
what query did you use to extract the sha commit hashes?
I am not sure I understand the question.
sorry, i assumed you were using a dataset with sql-like queries i was curious what did you use to inspect all the *.json.gz files and extract the sha commit hashes?
Ah, OK. I basically just used jq. For instance, to dump full PushEvents you can do:
curl -sSL https://data.gharchive.org/2023-08-02-8.json.gz | gunzip | jq 'select(.type == "PushEvent")'
Or, to dump all SHA commit hashes you can drill deeper:
curl -sSL https://data.gharchive.org/2023-08-02-8.json.gz | gunzip | jq -r 'select(.type == "PushEvent") | .payload.commits[].sha'
Finally, to dump all commits for a specific repository:
curl -sSL https://data.gharchive.org/2023-12-05-22.json.gz | gunzip | jq 'select(.type == "PushEvent") | select(.repo.name == "yt-dlp/yt-dlp") | .payload.commits[]'
Unfortunately, we can't and don't guarantee 100% coverage. The events API is bursty, and it's possible that we occasionally miss some events. There have also been downtime on both ends. It's hard to say why this particular set of commits is missing.