gharchive.org
gharchive.org copied to clipboard
Add gists to the archive
I'm working on a parallel project to archive gists and repositories, and discovered that it's currently impossible to grab a historical list of gists. A call to /public/gists returns only recent results. In addition, gist creation has been removed from the Event API.
Given that, would you be interested in running a parallel scraper which grabs the latest gist creation 'events' from /public/gists and stores the metadata?
I've done a lot of the work to fail at a completely different goal, but it should work for the kind of setup you already have running with little change: https://github.com/za3k/github-backup/blob/master/all_gists.rb You'll recognize the code, I think.
Hmm, didn't realize that was split into a different endpoint.. doh! To answer your question: yes, it would be great to track gists.
In terms of code, if the actual API is effectively the same, I'm wondering if we can just parameterize the URL and run multiple crawlers? I'm happy to set this up.. once we have the code in place.
Nope, the gist API returns a list of gists, not a list of Gist events in the same format as the Events API. They're not the same API split into two endpoints (sorry if caling them gist 'events' made that unclear)
See: https://developer.github.com/v3/gists/#list-gists
It's just that they're impossible to get historically, and it would be nice to store them in a streaming fashion, similarly to what you're doing.
The calls to get data are essentially the same just because the Github APIs are fairly consistent, but the returned data is different and I don't think you'd want to bundle it together.
Grr, it's frustrating that they're now under a separate stream and schema.. This means we can't merge them with other archives and need separate storage + BigQuery pipelines.
@briandoll any idea what motivated this change? Is there any chance of allowing these back into the timeline API?
Okay, I'm going to claim this is relatively urgent due to news.
I've been talking to Github's support team, which has been just super helpful. The person I'm talking to is investigating the possibility of a one-time dump to work around the problems with the API. It would be one-time though, so if we could get a streaming archive up and running before that (and again, no guarantee at all that it will even happen, but just in case), we could avoid any gap periods.
I can't host a streaming archive like that myself because I don't have a machine with 100% uptime or likely even the needed amount of storage, despite it being small.
I think the technical work to support this is relatively small; the outgoing call should just be "get the latest gists since date XXX" together with depagination.
@briandoll any idea what motivated this change?
I'm not sure, but it looks like Zach is connected to the right folks within GitHub Support so hopefully we'll be able to sort out a workable solution :+1:
I can't host a streaming archive like that myself because I don't have a machine with 100% uptime or likely even the needed amount of storage, despite it being small.
We can run it on same infrastructure as current crawler, that's not an issue. The real complication is that we now have two sets of gzip archives, and I'd have to duplicate the BigQuery pipeline.. which is painful, both from this end and for anyone interested in analyzing this data.
That said, at a minimum we could start the crawler and start gathering data.. how it's exposed may be a whole separate thread.
Yeah, I think a BigQuery pipeline is something we can do after for sure. Do you need anything from me to start gathering gists?
Any news on this?
@za3k no progress on this end.. stretched thin with other projects.
Any news now? :)
Sadly, no. I'll mark this as out of scope for this project.
Is this something I could help with?