wikiextractor icon indicating copy to clipboard operation
wikiextractor copied to clipboard

Wikiextractor not extracting

Open lalitkumarj opened this issue 7 years ago • 7 comments

Hi All

Does the wikiextractor work directly on the bz2 file? I used python setup.py to install the WikiExtractor.

Here is the command I then used: wikiextractor/WikiExtractor.py -o extracted enwiki-20170301-pages-articles-multistream.xml.bz2

Here is the output:

INFO: Loaded 0 templates in 0.0s
INFO: Starting page extraction from enwiki-20170301-pages-articles-multistream.xml.bz2.
INFO: Using 7 extract processes.
INFO: Finished 7-process extraction of 0 articles in 0.0s (0.0 art/s)

What am I doing wrong? Thanks!!

lalitkumarj avatar Mar 18 '17 15:03 lalitkumarj

So I was able to use: bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted

Not sure why it needs to work on an extracted version.

lalitkumarj avatar Mar 18 '17 16:03 lalitkumarj

"multistream" apparently means it was compressed in a different way (see here), so maybe WikiExtractor doesn't know how to handle that. It works on non-multistream bz2 dump files.

BrenBarn avatar Mar 18 '17 20:03 BrenBarn

https://github.com/attardi/wikiextractor/issues/61

markdimi avatar Mar 21 '17 14:03 markdimi

I was able to run on cygwin by the command : bzcat enwiki-latest-pages-articles-multistream.xml.bz2| WikiExtractor.py -o output -s --lists --filter_category categories.txt -

astha-chem avatar Feb 24 '18 18:02 astha-chem

So I was able to use: bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted

Not sure why it needs to work on an extracted version.

Should not it be bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted -?

zhixiaochuan12 avatar Mar 25 '19 14:03 zhixiaochuan12

So I was able to use: bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted Not sure why it needs to work on an extracted version.

Should not it be bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted -?

It should be "bzcat enwiki-20170301-pages-articles-multistream.xml.bz2|python wikiextractor/ WikiExtractor.py -o extracted -"

bruce803 avatar Dec 31 '19 00:12 bruce803

WikiExtractor.py depends on fileinput. fileinput depends on bz2 when using fileinput.hook_compressed and reading *.bz2 file. But bz2.BZ2File in python2 "does not support input files containing multiple streams", as it says here, https://docs.python.org/2.7/library/bz2.html

Possible workarounds would be:

  1. Use python3 instead

  2. Don't use multistream data (like #61 above)

  3. Use decompressed data (like bzcat/stdin method above)

  4. Import bz2file as if it were bz2 before importing fileinput, for example:

    import sys
    PY2 = sys.version_info[0] == 2
    if PY2:
        import bz2file as bz2
        sys.modules['bz2'] = bz2
    else:
        import bz2
    import fileinput
    

elbakramer avatar Feb 11 '20 10:02 elbakramer