wikiextractor UnicodeDecodeError when extracting from an (decompressed) .xml dump

UnicodeDecodeError when extracting from an (decompressed) .xml dump

Open helboukkouri opened this issue 5 years ago • 1 comments

Hi,

I am getting UnicodeDecodeErrors when I try to extract a decompressed .xml dump. For the record, this is how I am using the WikiExtractor:

WikiExtractor.py wikicorpus_en.xml -b 100M --processes 50 -o /path/to/extraction/folder/

I am suspecting this line to be at the origin of the problem since fileinput.hook_compressed seems to result in opening the file using open(filename, mode) instead of open(filename, mode, encoding='utf-8') which would avoid the decoding error.

Changing the line with this one solved the issue for me:

input = fileinput.FileInput(input_file, openhook=fileinput.hook_encoded(encoding='utf-8'))

If the tool wasn't intended to be used with an already decompressed .xml dump you can close this issue. Thank you.

Hicham

Jan 08 '20 17:01 helboukkouri

Thank you, that fixed the issue for me too.

Jan 12 '20 11:01 w-henderson

wikiextractor wikiextractor copied to clipboard

UnicodeDecodeError when extracting from an (decompressed) .xml dump

wikiextractor
wikiextractor copied to clipboard