wikiextractor icon indicating copy to clipboard operation
wikiextractor copied to clipboard

OSError: Invalid data stream

Open aakardwivedi opened this issue 5 years ago • 3 comments

I am trying to extract this wikipedia dump http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

I encountered the following error, after this I have to use Keyboard Interrupt(Ctrl +C) to stop the script. This generally occurs between ~(INFO: 420000 - INFO: 450000). Which is why I doubt it is because of a single corrupted file in the wiki dataset.

.
.
INFO: 443826	The Fellowship (FGFCMI)
INFO: 443834	Georges Ernest Boulanger
INFO: 443836	Earl of Dalhousie
INFO: 443838	Archibald Campbell, 1st Marquess of Argyll
INFO: 443842	Monroe Beardsley
INFO: 443849	Fellowship of Fundamental Bible Churches
INFO: 443851	USS Darter (SS-227)
INFO: 443852	Marbella
INFO: 443854	Rh disease
INFO: 443855	William K. Wimsatt
INFO: 443867	Martín Chambi
INFO: 443868	USS Shark (SS-314)
INFO: 443870	Bluecurve
INFO: 443871	Combustion (software)
INFO: 443873	Round the Twist
INFO: 443878	Primm, Nevada
Traceback (most recent call last):
  File "/home/aakar/Documents/venv/bin/WikiExtractor.py", line 4, in <module>
    __import__('pkg_resources').run_script('wikiextractor==2.69', 'WikiExtractor.py')
  File "/home/aakar/Documents/venv/lib/python3.6/site-packages/pkg_resources/__init__.py", line 666, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/aakar/Documents/venv/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1453, in run_script
    exec(script_code, namespace, namespace)
  File "/home/aakar/Documents/venv/lib/python3.6/site-packages/wikiextractor-2.69-py3.6.egg/EGG-INFO/scripts/WikiExtractor.py", line 3238, in <module>
  File "/home/aakar/Documents/venv/lib/python3.6/site-packages/wikiextractor-2.69-py3.6.egg/EGG-INFO/scripts/WikiExtractor.py", line 3228, in main
  File "/home/aakar/Documents/venv/lib/python3.6/site-packages/wikiextractor-2.69-py3.6.egg/EGG-INFO/scripts/WikiExtractor.py", line 2940, in process_dump
  File "/home/aakar/Documents/venv/lib/python3.6/site-packages/wikiextractor-2.69-py3.6.egg/EGG-INFO/scripts/WikiExtractor.py", line 2781, in pages_from
  File "/usr/lib/python3.6/fileinput.py", line 250, in __next__
    line = self._readline()
  File "/usr/lib/python3.6/bz2.py", line 219, in readline
    return self._buffer.readline(size)
  File "/usr/lib/python3.6/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/lib/python3.6/_compression.py", line 103, in read
    data = self._decompressor.decompress(rawblock, size)
OSError: Invalid data stream

Any help would be appreciated

Python 3.6.7 Ubuntu 18.04.2 LTS

aakardwivedi avatar Mar 07 '19 06:03 aakardwivedi

I have met exactly the same issue. Were you able to work around it?

todpole3 avatar Sep 03 '19 04:09 todpole3

I have the same problem. It actually started with a UnicodeError so I added reload(sys) to the original file. Once I did so I successfully went over the problematic file (I think it was only one) and now I am getting this (python 3.7, ubuntu 19.04). I am trying to clean DE wiki and I had no such issues with EN, PL and ES wiki...

marzenakrp avatar Sep 07 '19 08:09 marzenakrp

Did you check the md5sum of the dump file?

mjeensung avatar Apr 07 '20 23:04 mjeensung