gharchive.org
gharchive.org copied to clipboard
something weird with email sha1s
I noticed that some of the SHA1s of email addresses calculated for commits don't seem to be right.
For example (chosen randomly):
https://github.com/google/souper/commit/9cd0a32caba78693638692fdd0a4cf329065a627.patch
SELECT JSON_EXTRACT(payload, "$.commits[0].author") FROM `githubarchive.month.201802` WHERE id = "7311575614"
results in
{"name":"John Regehr","email":"[email protected]"}
but I would expect
$ echo -n regehr | sha1sum
6bfdb1c5e6b8340a1eb1d7a72d17faec1d09a979 -
There are other cases which are fine, but seemingly more wrong than right.
This one has me stumped. :worried:
@annafil can you think of any reason why this may happen? For example, any reason why the events API may report a different email address? Any other theories?
For reference, the hashing logic is https://github.com/igrigorik/githubarchive.org/blob/master/crawler/crawler.rb#L41-L44. I don't see any obvious gotchas there, but perhaps I'm overlooking something obvious?
- https://help.github.com/articles/about-commit-email-addresses/
- https://help.github.com/articles/setting-your-commit-email-address-on-github/
ooh this is a fun one @rspier and @igrigorik! :)
I verified that the API appears to be reporting the correct address in this event example, and that the hashing logic @igrigorik pointed to returns the sha1 that @rspier is expecting -- at least when I reproduce the logic locally with a simple example. However, when I run it on today's latest dump, I get hashes I'm not expecting to get. Suspect it might be something to do with the parsing step, will continue to investigate and update here.
OK I did a little more research. I cloned the repo, ran the crawler on my own machine with my own credentials for a little bit today, and compared the original payload with the sha1 sums being produced -- everything checks out and matches the sha1s I'm expecting.
The initial issue I saw with a smaller example was likely a one off - turns out there's a user who hashed the second half of their e-mail address and that's why it looked funny.
@rspier you mentioned you're seeing more examples that are wrong than right -- would you be able to share some of them with me to help me try to reproduce the issue?
Almost every one i look at seems wrong when I look at the BigQuery data.
here's another one:
SELECT
id,
CONCAT( "https://github.com/",repo.name,"/commit/",JSON_EXTRACT_SCALAR(payload,
"$.commits[0].sha"),".patch") AS url,
JSON_EXTRACT_SCALAR(payload,
"$.commits[0].author.email") AS email,
TO_HEX(SHA1("hamelsmu")) as computed
FROM
`githubarchive.month.201801`
WHERE
id = "7120532877"
AND repo.name = "igrigorik/githubarchive.org"
AND type ="PushEvent"
results in
{
"id":"7120532877",
"url":"https://github.com/igrigorik/githubarchive.org/commit/9bf9bbcff837e95953f5223c8dbe19a4ab4d5a11.patch",
"email":"[email protected]",
"computed":"3637a783b4a83928e1dd7aa9b76e39768e5fd769"
}
and here's another one
SELECT ... FROM `githubarchive.month.201711` WHERE id = "6919132679"
results in
{
"id":"6919132679",
"url":"https://github.com/igrigorik/githubarchive.org/commit/75df489390aebd1ab0b6e5e84b96b8c1a6113c3e.patch",
"email":"[email protected]",
"computed":"c9679860e330f3091c3e8d07027e1151860a1995"
}
I'm happy to provide more. These are just the first two I looked at (and found) today.
-R
thank you @rspier β¨ this is quite intriguing. I'm going to try to see how far back the issue goes over the weekend.
@igrigorik just in case, do you know offhand the versions of the dependencies you're using on the server that runs the crawler?
I need to run out now, but I quickly looked at the two pull requests from githubarchive.month.201601 (randomly chosen as "a while ago") and they (at least the first commit in each) looks ok.
(hit wrong button)
that's helpful, thanks @rspier :) (lol it's Friday)
@igrigorik just in case, do you know offhand the versions of the dependencies you're using on the server that runs the crawler?
igrigorik@worker:~/githubarchive.org/crawler$ cat Gemfile.lock
GIT
remote: git://github.com/eventmachine/eventmachine.git
revision: 4de7d0e00fe87819840ba7c7c91e7ed09b85467c
specs:
eventmachine (1.0.4)
GIT
remote: git://github.com/igrigorik/em-http-request.git
revision: 5145b7cf1fa3deb35629a115259dd260653ec7bd
specs:
em-http-request (1.1.2)
addressable (>= 2.3.4)
cookiejar (<= 0.3.0)
em-socksify (>= 0.3)
eventmachine (>= 1.0.3)
http_parser.rb (>= 0.6.0)
GEM
remote: http://rubygems.org/
specs:
addressable (2.3.6)
cookiejar (0.3.0)
em-socksify (0.3.0)
eventmachine (>= 1.0.0.beta.4)
em-stathat (0.1.0)
em-http-request (~> 1.0)
http_parser.rb (0.6.0)
log4r (1.1.10)
yajl-ruby (1.2.1)
PLATFORMS
ruby
DEPENDENCIES
em-http-request!
em-stathat
eventmachine!
log4r
yajl-ruby
igrigorik@worker:~/githubarchive.org/crawler$ ruby -v
ruby 2.1.5p273 (2014-11-13 revision 48405) [x86_64-linux]
π€
I spent some time tonight trying to get more data on when the issue started...
WITH
commits AS (
SELECT
JSON_EXTRACT_SCALAR(payload,
"$.commits[0].author.email") AS email,
JSON_EXTRACT_SCALAR(payload,
"$.commits[0].sha") AS commit,
repo.name AS repo_name
FROM
`githubarchive.month.201601`
WHERE
type ="PushEvent" )
SELECT
commits.email email1,
bpdc.author.email email2,
commits.commit,
CONCAT( "https://github.com/",commits.repo_name,"/commit/",commits.commit,".patch") AS url
FROM
commits,
`bigquery-public-data.github_repos.commits` AS bpdc
WHERE
commits.commit=bpdc.commit
AND commits.email LIKE "%@gmail.com"
AND bpdc.author.email LIKE "%@gmail.com"
AND commits.email != bpdc.author.email
LIMIT
100
but bigquery-public-data.github_repos.commits has enough inconsistencies that I'm not sure how useful the analysis is. (some commits hash the entire email address including the domain, some aren't even valid email addresses.)
This might mean that my 201601 statement above might be based on luck at what I looked at, as opposed to data. There's definitely mismatches between githubarchive and bigquery-public-data for that month.
but bigquery-public-data.github_repos.commits has enough inconsistencies that I'm not sure how useful the analysis is. (some commits hash the entire email address including the domain, some aren't even valid email addresses.)
I noticed this as well, and I saw examples where it was coming out from the API directly in this form -- I suspect some users have these configured on their machines to avoid spam, like scrambling domain names.
This might mean that my 201601 statement above might be based on luck at what I looked at, as opposed to data. There's definitely mismatches between githubarchive and bigquery-public-data for that month.
@rspier very interesting.. Initially I thought this might be something very server specific (but then I'd expect the behavior to be consistent). You mentioned you're seeing examples where the hashing is as expected, and others in the same period that are incorrect, which suggests something else is going on intermittently. I tried looking for an example using your approach:
This commit from 2016-05 produces the correct hash in the archive records: 36a39c9434af71617b1400654aa53af013a1888c
This commit from the same month, however, produces an incorrect hash even though the user e-mail identifier, project contributed to etc are the same as far as I can tell: 4c5276cdc87e5421b4d87fdc52b628b06e6575fe
FWIW, my specs -- the main difference I see in configuration is my Ruby version, but the libraries that handle the parsing and the fetching seem to be more or less on par:
ruby 2.4.2p198 (2017-09-14 revision 59899) [x86_64-darwin16]
cat Gemfile.lock
GIT
remote: git://github.com/eventmachine/eventmachine.git
revision: 4de7d0e00fe87819840ba7c7c91e7ed09b85467c
specs:
eventmachine (1.0.4)
GIT
remote: git://github.com/igrigorik/em-http-request.git
revision: 5145b7cf1fa3deb35629a115259dd260653ec7bd
specs:
em-http-request (1.1.2)
addressable (>= 2.3.4)
cookiejar (<= 0.3.0)
em-socksify (>= 0.3)
eventmachine (>= 1.0.3)
http_parser.rb (>= 0.6.0)
GEM
remote: http://rubygems.org/
specs:
addressable (2.3.6)
cookiejar (0.3.0)
em-socksify (0.3.0)
eventmachine (>= 1.0.0.beta.4)
em-stathat (0.1.0)
em-http-request (~> 1.0)
http_parser.rb (0.6.0)
log4r (1.1.10)
yajl-ruby (1.2.1)
PLATFORMS
ruby
DEPENDENCIES
em-http-request!
em-stathat
eventmachine!
log4r
yajl-ruby
PS. another reason bigquery-public-data.github_repos.commits (and the API itself) might have non-email address looking e-mail fields could be if users are working with a different git client/service and mirroring their activity on GitHub - the result is messy data, but I think these values are not unexpected in this context :)
It seems very unlikely (but not totally impossible) that ruby's sha1 is producing the wrong output.
It seems very unlikely (but not totally impossible) that ruby's sha1 is producing the wrong output.
agreed, I would especially expect that if it produced the wrong output it would at least do so consistently :)
Consistently wrong would definitely be a step forward.
I can't see anything about those MurderMysteryMansion commits that looks at all suspicious.
Could it be something related to EM (eventmachine) or Proc?
I think I know what's up (credit to my teammate for spotting this). The hashes that are incorrect appear to be hashes of the correct hashes :)
Example:
echo -n 36a39c9434af71617b1400654aa53af013a1888c | openssl sha1
4c5276cdc87e5421b4d87fdc52b628b06e6575fe
So things are getting hashed twice - there must be some sort of loop that we're not accounting for. I also noticed that for the examples I pulled from May 2016, this happens for commits after May 30th 2016, but not for commits before then.
@igrigorik can you think of anything that might have happened around then?
on the bright side: once we figure out and address the underlying problem, it is conceivable to recover from this and backfill the data with the right hashes :) π
Major props to your teammate for that find.
It looks like it started happening at 2016-05-27 19:54:28.000 UTC (or around that time, filtered to @gmail.com addresses for simplicity.)
WITH
commits AS (
SELECT
JSON_EXTRACT_SCALAR(payload,
"$.commits[0].author.email") AS email,
JSON_EXTRACT_SCALAR(payload,
"$.commits[0].sha") AS commit,
repo.name AS repo_name,
created_at
FROM
`githubarchive.month.201605`
WHERE
type ="PushEvent" )
SELECT
created_at,
commits.email gha_email,
bpdc.author.email bq_email,
commits.commit,
CONCAT( "https://github.com/",commits.repo_name,"/commit/",commits.commit,".patch") AS url
FROM
commits,
`bigquery-public-data.github_repos.commits` AS bpdc
WHERE
commits.commit=bpdc.commit
AND commits.email LIKE "%@gmail.com"
AND bpdc.author.email LIKE "%@gmail.com"
AND commits.email != bpdc.author.email
AND REPLACE(commits.email, "@gmail.com", "") = TO_HEX(SHA1(REPLACE(bpdc.author.email, "@gmail.com", "")))
ORDER BY
created_at ASC
LIMIT
100
Things started behaving somewhat strangely around 2016-05-26 00:03:45.000 UTC: after this point, for the rest of the day, the e-mail address written in was the same for 23,172 rows : da39a3ee5e6b4b0d3255bfef95601890afd80709 [edit because not sure that it's actually every subsequent row]
SELECT
JSON_EXTRACT_SCALAR(payload,
"$.push_id") AS push_id,
JSON_EXTRACT_SCALAR(payload,
"$.commits[0].author.email") AS email,
JSON_EXTRACT_SCALAR(payload,
"$.commits[0].author.name") AS name,
JSON_EXTRACT_SCALAR(payload,
"$.commits[0].url") as url,
min(created_at) FROM `githubarchive.day.20160526`
where JSON_EXTRACT_SCALAR(payload,
"$.commits[0].author.email")="da39a3ee5e6b4b0d3255bfef95601890afd80709"
group by 1,2,3,4
order by 5
the repeated da39a3ee5e6b4b0d3255bfef95601890afd80709 value for email stops at around 2016-05-27 19:54:31.000 UTC ^ consistent with what you're seeing @rspier, double hashing occurs thereafter π€
SELECT
JSON_EXTRACT_SCALAR(payload,
"$.push_id") AS push_id,
JSON_EXTRACT_SCALAR(payload,
"$.commits[0].author.email") AS email,
JSON_EXTRACT_SCALAR(payload,
"$.commits[0].author.name") AS name,
JSON_EXTRACT_SCALAR(payload,
"$.commits[0].url") as url,
min(created_at) FROM `githubarchive.day.20160527`
where created_at >= "2016-05-27 19:54:31.000 UTC"
group by 1,2,3,4
order by 5

First off, impressive detective work here β big thanks to you both. I think I know what's going on here.
- We started hashing emails on May 27th, 2016: https://github.com/igrigorik/githubarchive.org/commit/c9ae11426e5bcc30fe15617d009dfc602697ecde
- We needed to reprocess previous data to hash the emails as well and @arfon helped out with that
- He built a utility to normalize and scrub the data, which we then incorporated (June 24, 2016) into the bigquery upload script: https://github.com/igrigorik/githubarchive.org/commit/7567623bdb9ebfc73f9778dc8372e0dace79ea2f
So, now what we have is...
- The crawler grabs data from the API, hashes the emails and saves output to gzip files
- The upload script grabs the gzip file and... scrubs and hashes it again β extra clean!
<insert facepalm here>
So, the good news is: the gzip archives should not have the double hashed emails, which means we can fix the current logic and then reprocess and re-import the data from 06/2016 - now.
You're welcome. It was a fun one to figure out.
So, the good news is: the gzip archives should not have the double hashed emails, which means we can fix the current logic and then reprocess and re-import the data from 06/2016 - now.
Yay! :clap: Happy to help with this part if you need an extra pair of hands
@annafil if you're up for it, that would be a great help! I believe you should already have access to the GCP project and the GCE instance where the crawler runs and upload scripts run fromΒ β let me know if otherwise. In terms of actual work, I think we need...
- update the upload script to skip the scrub, since the input file is already doing it.
- rerun the BQ upload script for all files since may 27th, 2016... don't think there is anything crazy here, just a loop to rerun the upload job for each hourly archive.
Happy to help answer any questions along the way.
This is on my to-do list to fix after we're sure that https://github.com/igrigorik/gharchive.org/issues/178 is successful and stable, to minimize how many things we're changing in a relatively small time period :)