wikiextractor
wikiextractor copied to clipboard
UnicodeDecodeError
I get the following error on a linux system, parsing a wikidump file with utf8-bin encoding. Any suggestions?
Traceback (most recent call last):
File "/usr/local/bin/WikiExtractor.py", line 4, in <module>
__import__('pkg_resources').run_script('wikiextractor==2.69', 'WikiExtractor.py')
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 739, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1501, in run_script
exec(script_code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/wikiextractor-2.69-py2.7.egg/EGG-INFO/scripts/WikiExtractor.py", line 3238, in <module>
File "/usr/local/lib/python2.7/dist-packages/wikiextractor-2.69-py2.7.egg/EGG-INFO/scripts/WikiExtractor.py", line 3228, in main
File "/usr/local/lib/python2.7/dist-packages/wikiextractor-2.69-py2.7.egg/EGG-INFO/scripts/WikiExtractor.py", line 2849, in process_dump
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb6 in position 23: invalid start byte
Same error here
Traceback (most recent call last):
File "WikiExtractor.py", line 3238, in <module>
main()
File "WikiExtractor.py", line 3228, in main
args.compress, args.processes)
File "WikiExtractor.py", line 2940, in process_dump
for page_data in pages_from(input):
File "WikiExtractor.py", line 2782, in pages_from
if not isinstance(line, text_type): line = line.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
INFO: 1082465 Santa language
INFO: 1082468 History of Thailand (1932–1973)
INFO: 1082479 Banbury mutiny
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 104: invalid start byte
Same. It's really awful.
python -m wikiextractor.WikiExtractor enwiki-20200101-pages-articles-multistream.xml.bz2
INFO: Preprocessing 'enwiki-20200101-pages-articles-multistream.xml.bz2' to collect template definitions: this may take some time.
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ytang/bertvenv/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 621, in