wikiextractor
wikiextractor copied to clipboard
UnicodeDecodeError when extracting from an (decompressed) .xml dump
Hi,
I am getting UnicodeDecodeErrors when I try to extract a decompressed .xml dump. For the record, this is how I am using the WikiExtractor:
WikiExtractor.py wikicorpus_en.xml -b 100M --processes 50 -o /path/to/extraction/folder/
I am suspecting this line to be at the origin of the problem since fileinput.hook_compressed
seems to result in opening the file using open(filename, mode)
instead of open(filename, mode, encoding='utf-8')
which would avoid the decoding error.
Changing the line with this one solved the issue for me:
input = fileinput.FileInput(input_file, openhook=fileinput.hook_encoded(encoding='utf-8'))
If the tool wasn't intended to be used with an already decompressed .xml dump you can close this issue. Thank you.
Hicham
Thank you, that fixed the issue for me too.