gharchive.org icon indicating copy to clipboard operation
gharchive.org copied to clipboard

Only about 1/7 of all Linux commits recorded in GHA

Open lukaszgryglicki opened this issue 5 years ago • 9 comments

Hi, I have a question: Should all commits stored in git repository be present on the GitHub archives? I have a local clone of torvalds/linux repository. I'm counting all distinct commit's SHA's: git log --pretty=format:"%H" --since="2018-04-01" --until="2019-04-01" | uniq | wc -l. It gives 75141 commits for one year period 2018-04-01 - 2019-04-01. When analyzing all torvalds/linux commits stored in GitHub archives I can only get 10542 distinct SHAs. Maybe the problem is that Linux GitHub repository is only a mirror and most commits are not stored in GitHub API then?

lukaszgryglicki avatar Apr 09 '19 14:04 lukaszgryglicki

I've saved all single events JSONs which refer to torvalds/linux into a big JSON file (it is about 600M in size). There are many SHAs that are not present in this JSON. Google BigQuery also don't return them - but it uses githubarchive data-source, so I would expect the rootcause is here. I know that there were problems with getting all events in the past (GitHub API limits) - is it possible that only 1/7 of Linux commits are recorded in GHA due to this?

lukaszgryglicki avatar Apr 10 '19 12:04 lukaszgryglicki

Yes, I would expect some events to be missing, but no.. not 6/7th of them.

Can you share the query that you're running? Are you accounting for multiple commits in a single push?

igrigorik avatar Apr 11 '19 00:04 igrigorik

This is the query to count distinct SHAs (I'm taking all commits from push events, otherwise Linux only has about 700 pushes/year (while 75k commits for the same range - this is essential):

#standardSQL
create temp function get_shas(json string)
returns array<string>
language js as """
  try {
    return JSON.parse(json).commits.map(x=>x.sha);
  }
  catch(error) {
    return []
  }
""";
with pushes as (
  select
    get_shas(payload) as shas
  from
    `githubarchive.{{table}}`
  where
    type = 'PushEvent'
    and repo.name = 'torvalds/linux'
)
select
  count(distinct sha) as commits
from (
  select
    sha
  from
    pushes
  cross join
    unnest(pushes.shas) as sha
  where
    sha is not null
)

-- Returns 9841 for a year range {[table}} -> year.2018.

And this returns list of distinct SHAs for a given table and condition (which can be for example filter by repo.name = torvalds/linux):

#standardSQL
create temp function get_shas(json string)
returns array<string>
language js as """
  try {
    return JSON.parse(json).commits.map(x=>x.sha);
  }
  catch(error) {
    return []
  }
""";
with pushes as (
  select
    get_shas(payload) as shas
  from
    `githubarchive.{{table}}`
  where
    type = 'PushEvent'
    and {{cond}}
)
select
  distinct sha
from
  pushes
cross join
  unnest(pushes.shas) as sha
order by
  sha

But I was also grepping all JSON files (all that refer to torvalds/linux) and many many SHAs were not found in full JSON body.

lukaszgryglicki avatar Apr 11 '19 03:04 lukaszgryglicki

Hmm, this is an interesting puzzle.

I don't see any obvious issues with the query. Curious, have you tried capturing the error instead of just returning an empty array? Wondering if we're swallowing any errors there that we ought to pay attention to.

Re, grepping JSON: do you see a difference in counts between JSON archives and BigQuery?

igrigorik avatar Apr 11 '19 08:04 igrigorik

The error returning [] was for event types other than 'PushEvent' (so there was no commits field). The BQ number of commits and number of commits from DevStats (which parses all GHA JSON files) are "almost" equal (the difference is minute, less than 0.5% and that was probably due to slightly different date ranges I've checked recently). If needed I can check for the exact data later (it takes more than 1TB to process) but really DevStats returned about 10250 for a 4/2018-4/2019 and BQ returned 9840 for the year 2018 AFAIk. git returns 70-80k so the difference is huge.

lukaszgryglicki avatar Apr 11 '19 08:04 lukaszgryglicki

Gotcha, thanks for the context.

Have you tried looking at some of the missing commits: are there any common patterns or distinctions for those vs. what's in the archives? Are there time gaps, or maybe other distinctions?

igrigorik avatar Apr 13 '19 00:04 igrigorik

No I didn't make such an analysis, sorry.

lukaszgryglicki avatar Apr 13 '19 04:04 lukaszgryglicki

The push event docs state:

The array includes a maximum of 20 commits

If devs on this repo have a habit of pushing up larger amounts of commits at the same time, this might explain the discrepancy. You could do some spot checks through the commits API to compare push event payloads and actual commit counts.

chillu avatar Sep 12 '20 00:09 chillu

The precise date and time and hash of any missing commit would likely help diagnose this precisely.

xloem avatar Mar 23 '23 20:03 xloem