github-mirror
github-mirror copied to clipboard
Wrong created_at values in commits.csv
Looks to me that the most recent dump (2019-06-01) has wrong time stamp. Below is how we may reproduce the error.
In http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2019-06-01.tar.gz, I found the following,
$ grep "2038-01-18" commits.csv
239581704,"1e954ef83c79ccccb8f570d350ea0ff934650fbb",4498938,4498938,18281400,"2038-01-18 23:00:00"
242299182,"0ebc3ea059dfcc8124135b94dfaa40ba7d49e5b2",1340592,1340592,18533076,"2038-01-18 23:00:00"
606676461,"d4cbfbc31940a514f2974405b579ff8989f9a0b6",33691743,33691743,55773517,"2038-01-18 00:00:00"
I loaded the CSV files to a PostgreSQL database, and then do a query. These projects are,
ghtorrent=# SELECT id,url FROM projects WHERE id=18281400 OR id=18533076 OR id=55773517;
id | url
----------+---------------------------------------------------------------
18281400 | https://api.github.com/repos/cirosantilli/test-streak-puppet2
18533076 | https://api.github.com/repos/phy1729/illacceptanything
55773517 | https://api.github.com/repos/nhm150030/jfox
(3 rows)
Having compared the dates in the CSV file with the commit log in these projects. The dates are indeed wrong. Similarly, I also found these,
$ grep "2028-12-31 23:00:00" commits.csv
239657882,"182ef308602eabf227265eb2a5034ede567dccfa",4498938,4498938,18281400,"2028-12-31 23:00:00"
242561682,"cd952389edd54aa00cd3a98382da1193a1fc6d36",1340592,1340592,18507496,"2028-12-31 23:00:00"
In fact, if we group by the duration of commits of projects where the duration is the years between the oldest and the newest commits of a project and count the number of projects in a group, I got the following,
duration | count
----------+-------
0 | 492
1 | 632
2 | 414
3 | 363
4 | 325
5 | 321
6 | 277
7 | 214
8 | 202
9 | 150
10 | 102
11 | 106
12 | 89
13 | 82
14 | 93
15 | 60
16 | 63
17 | 51
18 | 49
19 | 38
20 | 33
21 | 24
22 | 8
23 | 11
24 | 14
25 | 13
26 | 8
27 | 4
28 | 1
29 | 2
32 | 1
33 | 1
34 | 3
36 | 1
38 | 2
40 | 3
42 | 1
43 | 1
44 | 9
45 | 7
46 | 15
47 | 10
48 | 2
49 | 7
58 | 1
68 | 2
(46 rows)
where count
is the number of projects whose duration
is given on the left column.
From the FAQ page:
I've seen weird commit timestamps
Git records the commit timestamp on the developer's workstation. If the clock is misconfigured, timestamps will be weird.
We have seen timestamps such as 0000-01-01 00:00 or 2034-12-31 23:59. GitHub and GHTorrent do not process the timestamps in any way.
One of the projects you referenced has future dates set against the authored at timestamps e.g. 2046-09-18T00:00:00Z
This problem affects all data on GH itself - here's a test commit I just made in 2021.
Unfortunately, I don't think there's anything we can do about it. At least instances of this should be pretty rare.