opensourcecontributors icon indicating copy to clipboard operation
opensourcecontributors copied to clipboard

Use event_id as ID where possible

Open hut8 opened this issue 8 years ago • 3 comments

This just occured to me. The pre-2015 events (in the timeline directory) don't have event_id attributes. However, the new ones all do. Maybe I could replace the MongoDB _id attribute with event_id for the post-2015 events. Dropping that index would likely result in a huge increase in insert performance, which we really need. Right now there are 4 indexes on that collection, and not being able to fit them in memory is what really slows things to a crawl.

Thoughts, @joshjordan ?

hut8 avatar Feb 01 '16 14:02 hut8

I think that is definitely worthwhile. I didn't realize Mongo was trying to keep 4 indexes in memory. Is it also possible to specify which indexes should be on disk vs in memory?

On Mon, Feb 1, 2016 at 9:29 AM Liam [email protected] wrote:

This just occured to me. The pre-2015 events (in the timeline directory) don't have event_id attributes. However, the new ones all do. Maybe I could replace the MongoDB _id attribute with event_id for the post-2015 events. Dropping that index would likely result in a huge increase in insert performance, which we really need. Right now there are 4 indexes on that collection, and not being able to fit them in memory is what really slows things to a crawl.

Thoughts, @joshjordan https://github.com/joshjordan ?

— Reply to this email directly or view it on GitHub https://github.com/tenex/github-contributions/issues/52.

joshjordan avatar Feb 01 '16 15:02 joshjordan

I just came across a few event objects missing an _event_id attribute and was wondering what was going on. Regardless of how you decide to handle this in mongo on the back-end, as an API consumer of these events, it would be confusing to expect an integer _event_id and instead get a string representation of the _id attribute.

s2t2 avatar Mar 17 '16 22:03 s2t2

The _event_id attribute is only present in events that were from the "Event API", which includes "events" from January 1, 2015 on. Prior to that, the GitHub Archive was using the Timeline API, which didn't have an "Event ID" per se. The main reason I'm actually using an index on the _event_id field (or dealing with that field at all) is to work around the fact that you can't atomically load thousands of documents in MongoDB, so a unique index on it guarantees duplicates aren't inserted. I should probably document that better :smile:

hut8 avatar Mar 18 '16 01:03 hut8