wiki-dump-reader icon indicating copy to clipboard operation
wiki-dump-reader copied to clipboard

Extract corpora from Wikipedia dumps

Wiki-Dump Reader

Travis Coverage

Extract corpora from wiki-dump.

Install

pip install wiki-dump-reader

Usage

The dump file *wiki-*-pages-articles.xml should be downloaded first. Then you can iterate and get cleaned text from the text:

from wiki_dump_reader import Cleaner, iterate

cleaner = Cleaner()
for title, text in iterate('*wiki-*-pages-articles.xml'):
    text = cleaner.clean_text(text)
    cleaned_text, links = cleaner.build_links(text)

Just ignore links if you don't need them:

cleaned_text, _ = cleaner.build_links(text)

See examples for an intuitive feeling.