gharchive.org icon indicating copy to clipboard operation
gharchive.org copied to clipboard

duplicated fork event

Open bluecoco opened this issue 7 years ago • 8 comments

While exploring 2017 github archive data, we found that there seems to duplicated events. For example, there can be two fork events with same information except created_at date and the event id. (Same actor, same repo, and same forkee id) Is it caused by the API capturing the same event multiple times? If so, is it safe to assume that we should use the latest record? Thanks!

bluecoco avatar Apr 04 '18 21:04 bluecoco

Interesting. We don't modify the created_at or any of the payload data, so the fact that multiple events are coming up means that the GH API provided multiple events. Do you have some particular examples you can share? How far apart are these events?

In theory, couldn't the user fork once, delete the repo (I don't think there is an event for this) and fork again? E.g. if I fork into wrong org, or some such?

@annafil any other ideas? :)

igrigorik avatar Apr 06 '18 00:04 igrigorik

@igrigorik those are all great questions, and I think you covered my initial guesses on this as well :) If the event_id is different, it is likely that multiple distinct events are being sent by the API.

@bluecoco if you could share some examples, I can try to track down if this is the same event re-occurring, or actually a user forking multiple times, as @igrigorik suggested.

annafil avatar Apr 06 '18 00:04 annafil

When I was investigating the issue, I tried fork, unfork and refork test and my event api showed different forkee.id. But in the cases we found in the data the forkee ids are the same.

The events time are very close, about half an hour apart.

One pair of examples here (may not contain all the fields):

{ "payload": { "forkee": { "updated_at": "2017-08-30T13:02:10Z", "private": false, "has_wiki": true, "full_name": "Leitnin/Repetier-Firmware", "owner": { "avatar_url": "https://avatars1.githubusercontent.com/u/9035220?v=4", "site_admin": false, "login": "Leitnin", "type": "User", "id": 9035220 }, "id": 101881905, "description": "Firmware for Arduino based RepRap 3D printer.", "has_pages": false, "open_issues_count": 0, "has_projects": true, "watchers_count": 0, "size": 16466, "public": true, "has_issues": false, "has_downloads": true, "name": "Repetier-Firmware", "language": "HTML", "created_at": "2017-08-30T13:02:06Z", "pushed_at": "2017-08-25T14:46:54Z", "extension": {}, "forks_count": 0, "default_branch": "master", } }, "created_at": "2017-08-30T13:33:43Z", "actor": { "avatar_url": "https://avatars.githubusercontent.com/u/9035220?", "login": "Leitnin", "id": 9035220 }, "id": "6529932288", "repo": { "id": 2323906, "name": "repetier/Repetier-Firmware" }, "type": "ForkEvent", "public": true }


{ "payload": { "forkee": { "updated_at": "2017-08-29T21:39:52Z", "private": false, "has_wiki": true, "full_name": "Leitnin/Repetier-Firmware", "owner": { "avatar_url": "https://avatars1.githubusercontent.com/u/9035220?v=4", "site_admin": false, "login": "Leitnin", "type": "User", "id": 9035220 }, "id": 101881905, "description": "Firmware for Arduino based RepRap 3D printer.", "has_pages": false, "open_issues_count": 0, "has_projects": true, "watchers_count": 0, "size": 16466, "public": true, "has_issues": false, "has_downloads": true, "name": "Repetier-Firmware", "language": null, "created_at": "2017-08-30T13:02:06Z", "pushed_at": "2017-08-25T14:46:54Z", "extension": {}, "forks_count": 0, "default_branch": "master", } }, "created_at": "2017-08-30T13:02:06Z", "actor": { "avatar_url": "https://avatars.githubusercontent.com/u/9035220?", "login": "Leitnin", "id": 9035220 }, "id": "6529780368", "repo": { "id": 2323906, "name": "repetier/Repetier-Firmware" }, "type": "ForkEvent", "public": true }

a few other pairs of fork events with same forkee_id, in the same hour data: 101881864 101881766 101881916

bluecoco avatar Apr 06 '18 01:04 bluecoco

this is great @bluecoco! Investigating....

annafil avatar Apr 06 '18 20:04 annafil

@bluecoco From what I've been able to learn so far, it seems like this is most likely a duplicate. I'm seeing some minor difference in some parts of the payload (e.g. a field is filled in in the later version, but missing/not specified in the earlier version), but I'm still tracking down the root of the issue. I think your best bet is to take the latest iteration of this event for counting purposes in the interim.

I could see a potential situation where someone creates multiple forks from a repo when contributing to an OSS project because this may be preferable to attempting to catch their fork up to the latest changes upstream. However, in that case I would expect the delta in event timestamps to be a lot longer than 1 hour :)

annafil avatar Apr 16 '18 20:04 annafil

Related: is it safe to assume that created_at timestamp for a forkee is the same or at least close enough to the create_at timestamp of the corresponding ForkEvent?

bluecoco avatar Apr 20 '18 21:04 bluecoco

I think it depends on what data point you're looking at. The created_at timestamp for a ForkEvent should be fairly close to the actual time the event occurred. How accurate the created_at timestamp for a forkee is, depends on the data source you're looking at :) I believe the state of the repo itself (or a fork of it) is not something that GHArchive records, so if you can share more about the data you're working with, I might be able to give more information there.

annafil avatar Apr 20 '18 21:04 annafil

We were using both GitHut Archive and GitHub Torrent data for this. It could be that some of the repo create data is estimated by torrent. And when I try to look up these repos' info, they are already not available, assuming deleted or turned int private. I will provide more examples once I dig a little more on this. Thanks.

bluecoco avatar Apr 23 '18 15:04 bluecoco