wikiextractor Added '--no-doc' and '--no-title' options to WikiExtractor.py

Added '--no-doc' and '--no-title' options to WikiExtractor.py

Open josemazo opened this issue 8 years ago • 4 comments

This new options are for get a completely clean text. Sometimes this is useful, for example, when you simply need a lot of text in a language for creating a text corrector or recognizer.

Mar 31 '16 14:03 josemazo

This is extremely useful, thank you!

Jun 07 '16 19:06 zachmayer

This needs to be rebased off master =/

Jul 06 '16 19:07 zachmayer

Could someone please fix the current conflicts so that this PR can get merged?

Jun 19 '18 13:06 PanderMusubi

You can do it simple command

cat extracted/wiki_en | sed "/^\s$/d" | grep -v "^<doc id=" | grep -v "$" > wiki.txt*

Nov 07 '19 07:11 mustfkeskin

wikiextractor wikiextractor copied to clipboard

Added '--no-doc' and '--no-title' options to WikiExtractor.py

wikiextractor
wikiextractor copied to clipboard