wikiextractor
wikiextractor copied to clipboard
OSError: Invalid data stream
I am trying to extract this wikipedia dump http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
I encountered the following error, after this I have to use Keyboard Interrupt(Ctrl +C) to stop the script. This generally occurs between ~(INFO: 420000 - INFO: 450000). Which is why I doubt it is because of a single corrupted file in the wiki dataset.
.
.
INFO: 443826 The Fellowship (FGFCMI)
INFO: 443834 Georges Ernest Boulanger
INFO: 443836 Earl of Dalhousie
INFO: 443838 Archibald Campbell, 1st Marquess of Argyll
INFO: 443842 Monroe Beardsley
INFO: 443849 Fellowship of Fundamental Bible Churches
INFO: 443851 USS Darter (SS-227)
INFO: 443852 Marbella
INFO: 443854 Rh disease
INFO: 443855 William K. Wimsatt
INFO: 443867 Martín Chambi
INFO: 443868 USS Shark (SS-314)
INFO: 443870 Bluecurve
INFO: 443871 Combustion (software)
INFO: 443873 Round the Twist
INFO: 443878 Primm, Nevada
Traceback (most recent call last):
File "/home/aakar/Documents/venv/bin/WikiExtractor.py", line 4, in <module>
__import__('pkg_resources').run_script('wikiextractor==2.69', 'WikiExtractor.py')
File "/home/aakar/Documents/venv/lib/python3.6/site-packages/pkg_resources/__init__.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/aakar/Documents/venv/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1453, in run_script
exec(script_code, namespace, namespace)
File "/home/aakar/Documents/venv/lib/python3.6/site-packages/wikiextractor-2.69-py3.6.egg/EGG-INFO/scripts/WikiExtractor.py", line 3238, in <module>
File "/home/aakar/Documents/venv/lib/python3.6/site-packages/wikiextractor-2.69-py3.6.egg/EGG-INFO/scripts/WikiExtractor.py", line 3228, in main
File "/home/aakar/Documents/venv/lib/python3.6/site-packages/wikiextractor-2.69-py3.6.egg/EGG-INFO/scripts/WikiExtractor.py", line 2940, in process_dump
File "/home/aakar/Documents/venv/lib/python3.6/site-packages/wikiextractor-2.69-py3.6.egg/EGG-INFO/scripts/WikiExtractor.py", line 2781, in pages_from
File "/usr/lib/python3.6/fileinput.py", line 250, in __next__
line = self._readline()
File "/usr/lib/python3.6/bz2.py", line 219, in readline
return self._buffer.readline(size)
File "/usr/lib/python3.6/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/usr/lib/python3.6/_compression.py", line 103, in read
data = self._decompressor.decompress(rawblock, size)
OSError: Invalid data stream
Any help would be appreciated
Python 3.6.7 Ubuntu 18.04.2 LTS
I have met exactly the same issue. Were you able to work around it?
I have the same problem. It actually started with a UnicodeError so I added reload(sys) to the original file. Once I did so I successfully went over the problematic file (I think it was only one) and now I am getting this (python 3.7, ubuntu 19.04). I am trying to clean DE wiki and I had no such issues with EN, PL and ES wiki...
Did you check the md5sum of the dump file?