ir_datasets
ir_datasets copied to clipboard
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs
Describe the bug The library was unable to decode byte into character.
Affected dataset(s)
-
msmarco-passage/dev/small
To Reproduce Steps to reproduce the behavior:
- Make sure
collectionandqueries.tar.gz
has already been downloaded in the respective dataset folder in~/.ir_datasets
folder - Run:
import ir_datasets
train = ir_datasets.load('msmarco-passage/dev/small')
for doc in train.docs_iter():
doc
- Wait for it to run, and you will see an error:
[INFO] [starting] fixing encoding
[INFO] [finished] fixing encoding: [07:07] [3.06GB] [7.16MB/s]
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
d:\Repos\XpressAI\vecto-reranking\1 - Dataset Exploration.ipynb Cell 6 in <cell line: 1>()
----> [1](vscode-notebook-cell:/d%3A/Repos/XpressAI/vecto-reranking/1%20-%20Dataset%20Exploration.ipynb#W5sZmlsZQ%3D%3D?line=0) for doc in train.docs_iter():
[2](vscode-notebook-cell:/d%3A/Repos/XpressAI/vecto-reranking/1%20-%20Dataset%20Exploration.ipynb#W5sZmlsZQ%3D%3D?line=1) doc
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\__init__.py:147, in DocstoreSplitter.__next__(self)
146 def __next__(self):
--> 147 return next(self.it)
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\formats\tsv.py:92, in TsvIter.__next__(self)
91 def __next__(self):
---> 92 line = next(self.line_iter)
93 cols = line.rstrip('\n').split('\t')
94 num_cols = len(self.cls._fields)
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\formats\tsv.py:30, in FileLineIter.__next__(self)
28 self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc.stream()))
29 while self.pos < self.start:
---> 30 line = self.stream.readline()
31 if line != '\n':
32 self.pos += 1
File ~\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined>
Expected behavior Decoding completes without error.
Additional context
Screenshot:
The symbol —
was all over the place in collections.tsv
. Maybe this is what causing the error?
Referring to this SO thread, maybe this is the solution?
In Windows, the default encoding is cp1252, but that readme file is most likely encoded in UTF8.
The error message tells you that cp1252 codec is unable to decode the character with the byte 0x9D. When I browsed through the readme file, I found this character: ” (also known as: "RIGHT DOUBLE QUOTATION MARK"), which has the bytes 0xE2 0x80 0x9D, which includes the problematic byte.
From:
with open('README.txt') as file: long_description = file.read()
Change into:
with open('README.txt', encoding="utf8") as file: long_description = file.read()
This will open the file with the proper encoding.
When checking Line 30 of ir_datasets\formats\tsv.py
, I found this line:
line = self.stream.readline()
and self.stream
is using the io.TextIOWrapper
instance.
Referring to the io
docs here:
The default encoding of TextIOWrapper and open() is locale-specific (locale.getpreferredencoding(False)).
However, many developers forget to specify the encoding when opening text files encoded in UTF-8 (e.g. JSON, TOML, Markdown, etc…) since most Unix platforms use UTF-8 locale by default. This causes bugs because the locale encoding is not UTF-8 for most Windows users. ... Accordingly, it is highly recommended that you specify the encoding explicitly when opening text files. If you want to use UTF-8, pass encoding="utf-8". To use the current locale encoding, encoding="locale" is supported in Python 3.10.
I modified Line 26 and 28 of ir_datasets\formats\tsv.py
in my venv - adding encoding="utf-8"
as an argument in io.TextIOWrapper
:
def __next__(self):
...
if self.stream is None:
if isinstance(self.dlc, list):
self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc[self.stream_idx].stream()), encoding="utf-8")
else:
self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc.stream()), encoding="utf-8")
while self.pos < self.start:
line = self.stream.readline()
if line != '\n':
self.pos += 1
...
and it runs without raising error now =) (I am using Windows 10) I would be pleased to open a PR on this, if the collaborators don't mind.
Thanks! I suspect it's this issue: https://github.com/allenai/ir_datasets/issues/151
There's a branch that fixes it, but for some reason, it hasn't been merged into the main branch: https://github.com/allenai/ir_datasets/tree/encoding-fixes
I'll look into merging in the changes that have been made since the branch was made and look into pulling it into the main branch.
It also looks like the FixEncoding
module was bypassed, which is why you're getting all the characters like —
. (FixEncoding
replaces them with their correct unicode versions.)
As with #209, I recommend just letting ir_datasets do its thing automatically. Or, if you already have a file and don't want to wait for the downloads, follow the instructions provided by the system.
Just to chime in, we've seen this same issue crop up with the irds:nfcorpus/dev
dataset too. @seanmacavaney is there any updated on getting the encoding fix branched merged? I only ask because I assigned my class a homework involving this dataset and now students who use Windows are reporting not being able to load it without error.