gharchive.org something weird with email sha1s

I noticed that some of the SHA1s of email addresses calculated for commits don't seem to be right.

For example (chosen randomly):

https://github.com/google/souper/commit/9cd0a32caba78693638692fdd0a4cf329065a627.patch

SELECT JSON_EXTRACT(payload, "$.commits[0].author") FROM  `githubarchive.month.201802` WHERE  id = "7311575614"

results in

{"name":"John Regehr","email":"[email protected]"}

but I would expect

$ echo -n regehr | sha1sum
6bfdb1c5e6b8340a1eb1d7a72d17faec1d09a979  -

There are other cases which are fine, but seemingly more wrong than right.

Mar 07 '18 19:03 rspier

This one has me stumped. :worried:

@annafil can you think of any reason why this may happen? For example, any reason why the events API may report a different email address? Any other theories?

For reference, the hashing logic is https://github.com/igrigorik/githubarchive.org/blob/master/crawler/crawler.rb#L41-L44. I don't see any obvious gotchas there, but perhaps I'm overlooking something obvious?

https://help.github.com/articles/about-commit-email-addresses/
https://help.github.com/articles/setting-your-commit-email-address-on-github/

Mar 10 '18 06:03 igrigorik

ooh this is a fun one @rspier and @igrigorik! :)

I verified that the API appears to be reporting the correct address in this event example, and that the hashing logic @igrigorik pointed to returns the sha1 that @rspier is expecting -- at least when I reproduce the logic locally with a simple example. However, when I run it on today's latest dump, I get hashes I'm not expecting to get. Suspect it might be something to do with the parsing step, will continue to investigate and update here.

Mar 13 '18 23:03 annafil

OK I did a little more research. I cloned the repo, ran the crawler on my own machine with my own credentials for a little bit today, and compared the original payload with the sha1 sums being produced -- everything checks out and matches the sha1s I'm expecting.

The initial issue I saw with a smaller example was likely a one off - turns out there's a user who hashed the second half of their e-mail address and that's why it looked funny.

@rspier you mentioned you're seeing more examples that are wrong than right -- would you be able to share some of them with me to help me try to reproduce the issue?

Mar 16 '18 22:03 annafil

Almost every one i look at seems wrong when I look at the BigQuery data.

here's another one:

SELECT
  id,
  CONCAT( "https://github.com/",repo.name,"/commit/",JSON_EXTRACT_SCALAR(payload,
      "$.commits[0].sha"),".patch") AS url,
  JSON_EXTRACT_SCALAR(payload,
    "$.commits[0].author.email") AS email,
    TO_HEX(SHA1("hamelsmu")) as computed
FROM
  `githubarchive.month.201801`
WHERE
  id = "7120532877"
  AND repo.name = "igrigorik/githubarchive.org"
  AND type ="PushEvent"

results in

{
"id":"7120532877",
"url":"https://github.com/igrigorik/githubarchive.org/commit/9bf9bbcff837e95953f5223c8dbe19a4ab4d5a11.patch",
"email":"[email protected]",
"computed":"3637a783b4a83928e1dd7aa9b76e39768e5fd769"
}

and here's another one

SELECT ... FROM `githubarchive.month.201711` WHERE id = "6919132679"

results in

{
 "id":"6919132679",
 "url":"https://github.com/igrigorik/githubarchive.org/commit/75df489390aebd1ab0b6e5e84b96b8c1a6113c3e.patch",
 "email":"[email protected]",
 "computed":"c9679860e330f3091c3e8d07027e1151860a1995"
}

I'm happy to provide more. These are just the first two I looked at (and found) today.

-R

Mar 16 '18 23:03 rspier

thank you @rspier ✨ this is quite intriguing. I'm going to try to see how far back the issue goes over the weekend.

@igrigorik just in case, do you know offhand the versions of the dependencies you're using on the server that runs the crawler?

Mar 16 '18 23:03 annafil

I need to run out now, but I quickly looked at the two pull requests from githubarchive.month.201601 (randomly chosen as "a while ago") and they (at least the first commit in each) looks ok.

Mar 16 '18 23:03 rspier

(hit wrong button)

Mar 16 '18 23:03 rspier

that's helpful, thanks @rspier :) (lol it's Friday)

Mar 16 '18 23:03 annafil

@igrigorik just in case, do you know offhand the versions of the dependencies you're using on the server that runs the crawler?

igrigorik@worker:~/githubarchive.org/crawler$ cat Gemfile.lock 
GIT
  remote: git://github.com/eventmachine/eventmachine.git
  revision: 4de7d0e00fe87819840ba7c7c91e7ed09b85467c
  specs:
    eventmachine (1.0.4)
GIT
  remote: git://github.com/igrigorik/em-http-request.git
  revision: 5145b7cf1fa3deb35629a115259dd260653ec7bd
  specs:
    em-http-request (1.1.2)
      addressable (>= 2.3.4)
      cookiejar (<= 0.3.0)
      em-socksify (>= 0.3)
      eventmachine (>= 1.0.3)
      http_parser.rb (>= 0.6.0)
GEM
  remote: http://rubygems.org/
  specs:
    addressable (2.3.6)
    cookiejar (0.3.0)
    em-socksify (0.3.0)
      eventmachine (>= 1.0.0.beta.4)
    em-stathat (0.1.0)
      em-http-request (~> 1.0)
    http_parser.rb (0.6.0)
    log4r (1.1.10)
    yajl-ruby (1.2.1)
PLATFORMS
  ruby
DEPENDENCIES
  em-http-request!
  em-stathat
  eventmachine!
  log4r
  yajl-ruby

igrigorik@worker:~/githubarchive.org/crawler$ ruby -v
ruby 2.1.5p273 (2014-11-13 revision 48405) [x86_64-linux]

🤔

Mar 17 '18 17:03 igrigorik

I spent some time tonight trying to get more data on when the issue started...

WITH
  commits AS (
  SELECT
    JSON_EXTRACT_SCALAR(payload,
      "$.commits[0].author.email") AS email,
    JSON_EXTRACT_SCALAR(payload,
      "$.commits[0].sha") AS commit,
    repo.name AS repo_name
  FROM
    `githubarchive.month.201601`
  WHERE
    type ="PushEvent" )
SELECT
  commits.email email1,
  bpdc.author.email email2,
  commits.commit,
  CONCAT( "https://github.com/",commits.repo_name,"/commit/",commits.commit,".patch") AS url
FROM
  commits,
  `bigquery-public-data.github_repos.commits` AS bpdc
WHERE
  commits.commit=bpdc.commit
  AND commits.email LIKE "%@gmail.com"
  AND bpdc.author.email LIKE "%@gmail.com"
  AND commits.email != bpdc.author.email
LIMIT
  100

but bigquery-public-data.github_repos.commits has enough inconsistencies that I'm not sure how useful the analysis is. (some commits hash the entire email address including the domain, some aren't even valid email addresses.)

This might mean that my 201601 statement above might be based on luck at what I looked at, as opposed to data. There's definitely mismatches between githubarchive and bigquery-public-data for that month.

Mar 18 '18 05:03 rspier

but bigquery-public-data.github_repos.commits has enough inconsistencies that I'm not sure how useful the analysis is. (some commits hash the entire email address including the domain, some aren't even valid email addresses.)

I noticed this as well, and I saw examples where it was coming out from the API directly in this form -- I suspect some users have these configured on their machines to avoid spam, like scrambling domain names.

This might mean that my 201601 statement above might be based on luck at what I looked at, as opposed to data. There's definitely mismatches between githubarchive and bigquery-public-data for that month.

@rspier very interesting.. Initially I thought this might be something very server specific (but then I'd expect the behavior to be consistent). You mentioned you're seeing examples where the hashing is as expected, and others in the same period that are incorrect, which suggests something else is going on intermittently. I tried looking for an example using your approach:

This commit from 2016-05 produces the correct hash in the archive records: 36a39c9434af71617b1400654aa53af013a1888c

This commit from the same month, however, produces an incorrect hash even though the user e-mail identifier, project contributed to etc are the same as far as I can tell: 4c5276cdc87e5421b4d87fdc52b628b06e6575fe

FWIW, my specs -- the main difference I see in configuration is my Ruby version, but the libraries that handle the parsing and the fetching seem to be more or less on par:

ruby 2.4.2p198 (2017-09-14 revision 59899) [x86_64-darwin16]

cat Gemfile.lock
GIT
  remote: git://github.com/eventmachine/eventmachine.git
  revision: 4de7d0e00fe87819840ba7c7c91e7ed09b85467c
  specs:
    eventmachine (1.0.4)

GIT
  remote: git://github.com/igrigorik/em-http-request.git
  revision: 5145b7cf1fa3deb35629a115259dd260653ec7bd
  specs:
    em-http-request (1.1.2)
      addressable (>= 2.3.4)
      cookiejar (<= 0.3.0)
      em-socksify (>= 0.3)
      eventmachine (>= 1.0.3)
      http_parser.rb (>= 0.6.0)

GEM
  remote: http://rubygems.org/
  specs:
    addressable (2.3.6)
    cookiejar (0.3.0)
    em-socksify (0.3.0)
      eventmachine (>= 1.0.0.beta.4)
    em-stathat (0.1.0)
      em-http-request (~> 1.0)
    http_parser.rb (0.6.0)
    log4r (1.1.10)
    yajl-ruby (1.2.1)

PLATFORMS
  ruby

DEPENDENCIES
  em-http-request!
  em-stathat
  eventmachine!
  log4r
  yajl-ruby

Mar 19 '18 20:03 annafil

PS. another reason bigquery-public-data.github_repos.commits (and the API itself) might have non-email address looking e-mail fields could be if users are working with a different git client/service and mirroring their activity on GitHub - the result is messy data, but I think these values are not unexpected in this context :)

Mar 19 '18 20:03 annafil

It seems very unlikely (but not totally impossible) that ruby's sha1 is producing the wrong output.

Mar 19 '18 20:03 rspier

It seems very unlikely (but not totally impossible) that ruby's sha1 is producing the wrong output.

agreed, I would especially expect that if it produced the wrong output it would at least do so consistently :)

Mar 19 '18 20:03 annafil

Consistently wrong would definitely be a step forward.

I can't see anything about those MurderMysteryMansion commits that looks at all suspicious.

Could it be something related to EM (eventmachine) or Proc?

Mar 19 '18 20:03 rspier

I think I know what's up (credit to my teammate for spotting this). The hashes that are incorrect appear to be hashes of the correct hashes :)

Example:

echo -n 36a39c9434af71617b1400654aa53af013a1888c | openssl sha1
4c5276cdc87e5421b4d87fdc52b628b06e6575fe

So things are getting hashed twice - there must be some sort of loop that we're not accounting for. I also noticed that for the examples I pulled from May 2016, this happens for commits after May 30th 2016, but not for commits before then.

@igrigorik can you think of anything that might have happened around then?

Mar 19 '18 20:03 annafil

on the bright side: once we figure out and address the underlying problem, it is conceivable to recover from this and backfill the data with the right hashes :) 🎉

Mar 19 '18 20:03 annafil

Major props to your teammate for that find.

It looks like it started happening at 2016-05-27 19:54:28.000 UTC (or around that time, filtered to @gmail.com addresses for simplicity.)

WITH
  commits AS (
  SELECT
    JSON_EXTRACT_SCALAR(payload,
      "$.commits[0].author.email") AS email,
    JSON_EXTRACT_SCALAR(payload,
      "$.commits[0].sha") AS commit,
    repo.name AS repo_name,
    created_at
  FROM
    `githubarchive.month.201605`
  WHERE
    type ="PushEvent" )
SELECT
  created_at,
  commits.email gha_email,
  bpdc.author.email bq_email,
  commits.commit,
  CONCAT( "https://github.com/",commits.repo_name,"/commit/",commits.commit,".patch") AS url
FROM
  commits,
  `bigquery-public-data.github_repos.commits` AS bpdc
WHERE
  commits.commit=bpdc.commit
  AND commits.email LIKE "%@gmail.com"
  AND bpdc.author.email LIKE "%@gmail.com"
  AND commits.email != bpdc.author.email
  AND REPLACE(commits.email, "@gmail.com", "") = TO_HEX(SHA1(REPLACE(bpdc.author.email, "@gmail.com", "")))
ORDER BY
  created_at ASC
LIMIT
  100

Mar 19 '18 21:03 rspier

Things started behaving somewhat strangely around 2016-05-26 00:03:45.000 UTC: after this point, for the rest of the day, the e-mail address written in was the same for 23,172 rows : da39a3ee5e6b4b0d3255bfef95601890afd80709 [edit because not sure that it's actually every subsequent row]

SELECT 
  JSON_EXTRACT_SCALAR(payload,
      "$.push_id") AS push_id,
  JSON_EXTRACT_SCALAR(payload,
      "$.commits[0].author.email") AS email,  
  JSON_EXTRACT_SCALAR(payload,
      "$.commits[0].author.name") AS name,  
  JSON_EXTRACT_SCALAR(payload,
      "$.commits[0].url") as url,
  min(created_at) FROM `githubarchive.day.20160526` 
where JSON_EXTRACT_SCALAR(payload,
      "$.commits[0].author.email")="da39a3ee5e6b4b0d3255bfef95601890afd80709"
group by 1,2,3,4
order by 5

Mar 19 '18 22:03 annafil

the repeated da39a3ee5e6b4b0d3255bfef95601890afd80709 value for email stops at around 2016-05-27 19:54:31.000 UTC ^ consistent with what you're seeing @rspier, double hashing occurs thereafter 🤔

SELECT 
  JSON_EXTRACT_SCALAR(payload,
      "$.push_id") AS push_id,
  JSON_EXTRACT_SCALAR(payload,
      "$.commits[0].author.email") AS email,  
  JSON_EXTRACT_SCALAR(payload,
      "$.commits[0].author.name") AS name,  
  JSON_EXTRACT_SCALAR(payload,
      "$.commits[0].url") as url,
  min(created_at) FROM `githubarchive.day.20160527` 
where created_at >= "2016-05-27 19:54:31.000 UTC"
group by 1,2,3,4
order by 5

Mar 19 '18 22:03 annafil

First off, impressive detective work here — big thanks to you both. I think I know what's going on here.

We started hashing emails on May 27th, 2016: https://github.com/igrigorik/githubarchive.org/commit/c9ae11426e5bcc30fe15617d009dfc602697ecde
We needed to reprocess previous data to hash the emails as well and @arfon helped out with that
- He built a utility to normalize and scrub the data, which we then incorporated (June 24, 2016) into the bigquery upload script: https://github.com/igrigorik/githubarchive.org/commit/7567623bdb9ebfc73f9778dc8372e0dace79ea2f

So, now what we have is...

The crawler grabs data from the API, hashes the emails and saves output to gzip files
The upload script grabs the gzip file and... scrubs and hashes it again — extra clean!
- <insert facepalm here>

So, the good news is: the gzip archives should not have the double hashed emails, which means we can fix the current logic and then reprocess and re-import the data from 06/2016 - now.

Mar 21 '18 05:03 igrigorik

You're welcome. It was a fun one to figure out.

Mar 22 '18 03:03 rspier

So, the good news is: the gzip archives should not have the double hashed emails, which means we can fix the current logic and then reprocess and re-import the data from 06/2016 - now.

Yay! :clap: Happy to help with this part if you need an extra pair of hands

Mar 22 '18 03:03 annafil

@annafil if you're up for it, that would be a great help! I believe you should already have access to the GCP project and the GCE instance where the crawler runs and upload scripts run from — let me know if otherwise. In terms of actual work, I think we need...

update the upload script to skip the scrub, since the input file is already doing it.
rerun the BQ upload script for all files since may 27th, 2016... don't think there is anything crazy here, just a loop to rerun the upload job for each hourly archive.

Happy to help answer any questions along the way.

Mar 26 '18 06:03 igrigorik

This is on my to-do list to fix after we're sure that https://github.com/igrigorik/gharchive.org/issues/178 is successful and stable, to minimize how many things we're changing in a relatively small time period :)

Apr 06 '18 00:04 annafil

gharchive.org gharchive.org copied to clipboard

something weird with email sha1s

gharchive.org
gharchive.org copied to clipboard