wiki-dump-reader
wiki-dump-reader copied to clipboard
Extract corpora from Wikipedia dumps
Wiki-Dump Reader
Extract corpora from wiki-dump.
Install
pip install wiki-dump-reader
Usage
The dump file *wiki-*-pages-articles.xml
should be downloaded first. Then you can iterate and get cleaned text from the text:
from wiki_dump_reader import Cleaner, iterate
cleaner = Cleaner()
for title, text in iterate('*wiki-*-pages-articles.xml'):
text = cleaner.clean_text(text)
cleaned_text, links = cleaner.build_links(text)
Just ignore links
if you don't need them:
cleaned_text, _ = cleaner.build_links(text)
See examples for an intuitive feeling.