sqldump-to icon indicating copy to clipboard operation
sqldump-to copied to clipboard

Wikipedia dumps available?

Open fhoffa opened this issue 6 years ago • 2 comments

Nice job!

You mentioned that you're using this tool to load Wikimedia dumps into BigQuery?

https://medium.com/@arjunmehta/felipe-hoffa-ive-built-a-cli-tool-to-take-any-mysql-dump-and-convert-to-newline-delimited-json-5b75895e9ae2

It would be really cool if you could share those tables!

fhoffa avatar Sep 16 '19 21:09 fhoffa

@fhoffa Wikimedia foundation provides public dumps for all wikipedia language versions for public access (Read more here). These are attempted to be dumped about once every two weeks. I think they generally always complete successfully.

For example, here is the enwiki set for 2019-09-01

I would like to help in getting them into the fh-bigquery public set (or some other), but I'm not sure if that would comply with Wikipedia's licensing for the dumps: Creative Commons Sharealike.

It seems like a fairly generous license, but requires attribution, and I'm not sure how that could be done on BigQuery. Maybe in the dataset/table description field would be sufficient?

I think it would be fantastic to create a cron job to automatically update BigQuery tables, as a type of "mirror" for these database dumps.

You'll also notice that some of the dump files are XML. These files contain the article content, whereas the SQL dump files are metadata and link information. This module was designed for the latter SQL dump files.

arjunmehta avatar Sep 17 '19 22:09 arjunmehta

I would be more than happy to create a script/tool to load the massive article content data itself (from the xml files) into BigQuery tables as well, but I don't have the resources to actually handle that much data and the cost associated.

arjunmehta avatar Sep 17 '19 23:09 arjunmehta