wikiextractor
wikiextractor copied to clipboard
Wikiextractor not extracting
Hi All
Does the wikiextractor work directly on the bz2 file? I used python setup.py to install the WikiExtractor.
Here is the command I then used:
wikiextractor/WikiExtractor.py -o extracted enwiki-20170301-pages-articles-multistream.xml.bz2
Here is the output:
INFO: Loaded 0 templates in 0.0s
INFO: Starting page extraction from enwiki-20170301-pages-articles-multistream.xml.bz2.
INFO: Using 7 extract processes.
INFO: Finished 7-process extraction of 0 articles in 0.0s (0.0 art/s)
What am I doing wrong? Thanks!!
So I was able to use:
bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted
Not sure why it needs to work on an extracted version.
"multistream" apparently means it was compressed in a different way (see here), so maybe WikiExtractor doesn't know how to handle that. It works on non-multistream bz2 dump files.
https://github.com/attardi/wikiextractor/issues/61
I was able to run on cygwin by the command :
bzcat enwiki-latest-pages-articles-multistream.xml.bz2| WikiExtractor.py -o output -s --lists --filter_category categories.txt -
So I was able to use:
bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted
Not sure why it needs to work on an extracted version.
Should not it be bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted -
?
So I was able to use:
bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted
Not sure why it needs to work on an extracted version.Should not it be
bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted -
?
It should be "bzcat enwiki-20170301-pages-articles-multistream.xml.bz2|python wikiextractor/ WikiExtractor.py -o extracted -"
WikiExtractor.py
depends on fileinput
.
fileinput
depends on bz2
when using fileinput.hook_compressed
and reading *.bz2
file.
But bz2.BZ2File
in python2 "does not support input files containing multiple streams", as it says here, https://docs.python.org/2.7/library/bz2.html
Possible workarounds would be:
-
Use
python3
instead -
Don't use
multistream
data (like#61
above) -
Use decompressed data (like
bzcat/stdin
method above) -
Import
bz2file
as if it werebz2
before importingfileinput
, for example:import sys PY2 = sys.version_info[0] == 2 if PY2: import bz2file as bz2 sys.modules['bz2'] = bz2 else: import bz2 import fileinput