wikidump
wikidump copied to clipboard
This Library
Hey this code looks perfect for a research project I'm working on. I downloaded the code through canopy and now I'm just trying to figure out how this code works. Do you have any documentation or file to start reading to understand better?
I don't have any plans at the moment to develop this project in the immediate future. That said, it is in a usable state, and I've used it myself fairly recently. I'm not familiar with canopy, but if you install it like a normal Python package it will install a command-line tool, wikidump
. wikidump -h
provides some details on how to use it. When run, wikidump
will generate a config file wikidump.cfg in the directory it was run it. This config file contains two paths you will need to amend, 'scratch', where the indexes can be stored, and 'xml_dumps', a path to a directory containing the downloaded xml dumps from Wikipedia. I've personally been using wp-download to download the dumps, so the path that wp-download saves them to is the path you want to set xml_dumps to. After downloading the relevant dumps, do wikidump index
, and thereafter you can use wikidump dataset
to pull out a dataset. Each of the commands should have a bit of help text, for example wikidump dataset -h
. Let me know if you need help with figuring out how to do anything specifically, and I'll see if it can be done under the current implementation.