wikiextractor icon indicating copy to clipboard operation
wikiextractor copied to clipboard

UnicodeDecodeError

Open bcompositor opened this issue 6 years ago • 3 comments

I get the following error on a linux system, parsing a wikidump file with utf8-bin encoding. Any suggestions?

Traceback (most recent call last):
  File "/usr/local/bin/WikiExtractor.py", line 4, in <module>
    __import__('pkg_resources').run_script('wikiextractor==2.69', 'WikiExtractor.py')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 739, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1501, in run_script
    exec(script_code, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/wikiextractor-2.69-py2.7.egg/EGG-INFO/scripts/WikiExtractor.py", line 3238, in <module>
    
  File "/usr/local/lib/python2.7/dist-packages/wikiextractor-2.69-py2.7.egg/EGG-INFO/scripts/WikiExtractor.py", line 3228, in main
    
  File "/usr/local/lib/python2.7/dist-packages/wikiextractor-2.69-py2.7.egg/EGG-INFO/scripts/WikiExtractor.py", line 2849, in process_dump
    
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb6 in position 23: invalid start byte

bcompositor avatar Sep 26 '17 15:09 bcompositor

Same error here

Traceback (most recent call last):
  File "WikiExtractor.py", line 3238, in <module>
    main()
  File "WikiExtractor.py", line 3228, in main
    args.compress, args.processes)
  File "WikiExtractor.py", line 2940, in process_dump
    for page_data in pages_from(input):
  File "WikiExtractor.py", line 2782, in pages_from
    if not isinstance(line, text_type): line = line.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
INFO: 1082465	Santa language
INFO: 1082468	History of Thailand (1932–1973)
INFO: 1082479	Banbury mutiny
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 104: invalid start byte

arpit1997 avatar Nov 07 '17 10:11 arpit1997

Same. It's really awful.

love-maker avatar May 07 '18 06:05 love-maker

python -m wikiextractor.WikiExtractor enwiki-20200101-pages-articles-multistream.xml.bz2 INFO: Preprocessing 'enwiki-20200101-pages-articles-multistream.xml.bz2' to collect template definitions: this may take some time. Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/ytang/bertvenv/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 621, in main() File "/home/ytang/bertvenv/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 616, in main process_dump(input_file, args.templates, output_path, file_size, File "/home/ytang/bertvenv/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 329, in process_dump templates = load_templates(input, template_file) File "/home/ytang/bertvenv/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 204, in load_templates for line in file: File "/usr/lib/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 80: invalid start byte

htang2012 avatar Mar 25 '21 22:03 htang2012