gharchive.org Add gists to the archive

Add gists to the archive

Open za3k opened this issue 8 years ago • 9 comments

I'm working on a parallel project to archive gists and repositories, and discovered that it's currently impossible to grab a historical list of gists. A call to /public/gists returns only recent results. In addition, gist creation has been removed from the Event API.

Given that, would you be interested in running a parallel scraper which grabs the latest gist creation 'events' from /public/gists and stores the metadata?

I've done a lot of the work to fail at a completely different goal, but it should work for the kind of setup you already have running with little change: https://github.com/za3k/github-backup/blob/master/all_gists.rb You'll recognize the code, I think.

Nov 10 '15 11:11 za3k

Hmm, didn't realize that was split into a different endpoint.. doh! To answer your question: yes, it would be great to track gists.

In terms of code, if the actual API is effectively the same, I'm wondering if we can just parameterize the URL and run multiple crawlers? I'm happy to set this up.. once we have the code in place.

Nov 10 '15 18:11 igrigorik

Nope, the gist API returns a list of gists, not a list of Gist events in the same format as the Events API. They're not the same API split into two endpoints (sorry if caling them gist 'events' made that unclear)

See: https://developer.github.com/v3/gists/#list-gists

It's just that they're impossible to get historically, and it would be nice to store them in a streaming fashion, similarly to what you're doing.

The calls to get data are essentially the same just because the Github APIs are fairly consistent, but the returned data is different and I don't think you'd want to bundle it together.

Nov 10 '15 21:11 za3k

Grr, it's frustrating that they're now under a separate stream and schema.. This means we can't merge them with other archives and need separate storage + BigQuery pipelines.

@briandoll any idea what motivated this change? Is there any chance of allowing these back into the timeline API?

Nov 15 '15 06:11 igrigorik

Okay, I'm going to claim this is relatively urgent due to news.

I've been talking to Github's support team, which has been just super helpful. The person I'm talking to is investigating the possibility of a one-time dump to work around the problems with the API. It would be one-time though, so if we could get a streaming archive up and running before that (and again, no guarantee at all that it will even happen, but just in case), we could avoid any gap periods.

I can't host a streaming archive like that myself because I don't have a machine with 100% uptime or likely even the needed amount of storage, despite it being small.

I think the technical work to support this is relatively small; the outgoing call should just be "get the latest gists since date XXX" together with depagination.

Nov 16 '15 22:11 za3k

@briandoll any idea what motivated this change?

I'm not sure, but it looks like Zach is connected to the right folks within GitHub Support so hopefully we'll be able to sort out a workable solution :+1:

Nov 17 '15 04:11 briandoll

I can't host a streaming archive like that myself because I don't have a machine with 100% uptime or likely even the needed amount of storage, despite it being small.

We can run it on same infrastructure as current crawler, that's not an issue. The real complication is that we now have two sets of gzip archives, and I'd have to duplicate the BigQuery pipeline.. which is painful, both from this end and for anyone interested in analyzing this data.

That said, at a minimum we could start the crawler and start gathering data.. how it's exposed may be a whole separate thread.

Nov 21 '15 19:11 igrigorik

Yeah, I think a BigQuery pipeline is something we can do after for sure. Do you need anything from me to start gathering gists?

Nov 23 '15 00:11 za3k

Any news on this?

Sep 10 '17 05:09 za3k

@za3k no progress on this end.. stretched thin with other projects.

Sep 10 '17 15:09 igrigorik

Any news now? :)

Feb 21 '23 12:02 Murplugg

Sadly, no. I'll mark this as out of scope for this project.

Feb 22 '23 16:02 igrigorik

Is this something I could help with?

Feb 22 '23 20:02 za3k

gharchive.org gharchive.org copied to clipboard

Add gists to the archive

gharchive.org
gharchive.org copied to clipboard