wikiextractor
wikiextractor copied to clipboard
Added '--no-doc' and '--no-title' options to WikiExtractor.py
This new options are for get a completely clean text. Sometimes this is useful, for example, when you simply need a lot of text in a language for creating a text corrector or recognizer.
This is extremely useful, thank you!
This needs to be rebased off master =/
Could someone please fix the current conflicts so that this PR can get merged?
You can do it simple command
cat extracted/wiki_en | sed "/^\s$/d" | grep -v "^<doc id=" | grep -v "$" > wiki.txt*