ir_datasets icon indicating copy to clipboard operation
ir_datasets copied to clipboard

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs

Open yuenherny opened this issue 2 years ago • 6 comments

Describe the bug The library was unable to decode byte into character.

Affected dataset(s)

  • msmarco-passage/dev/small

To Reproduce Steps to reproduce the behavior:

  1. Make sure collectionandqueries.tar.gz has already been downloaded in the respective dataset folder in ~/.ir_datasets folder
  2. Run:
import ir_datasets
train = ir_datasets.load('msmarco-passage/dev/small')
for doc in train.docs_iter():
    doc
  1. Wait for it to run, and you will see an error:
[INFO] [starting] fixing encoding
[INFO] [finished] fixing encoding: [07:07] [3.06GB] [7.16MB/s]
                                            
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
d:\Repos\XpressAI\vecto-reranking\1 - Dataset Exploration.ipynb Cell 6 in <cell line: 1>()
----> [1](vscode-notebook-cell:/d%3A/Repos/XpressAI/vecto-reranking/1%20-%20Dataset%20Exploration.ipynb#W5sZmlsZQ%3D%3D?line=0) for doc in train.docs_iter():
      [2](vscode-notebook-cell:/d%3A/Repos/XpressAI/vecto-reranking/1%20-%20Dataset%20Exploration.ipynb#W5sZmlsZQ%3D%3D?line=1)     doc

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\__init__.py:147, in DocstoreSplitter.__next__(self)
    146 def __next__(self):
--> 147     return next(self.it)

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\formats\tsv.py:92, in TsvIter.__next__(self)
     91 def __next__(self):
---> 92     line = next(self.line_iter)
     93     cols = line.rstrip('\n').split('\t')
     94     num_cols = len(self.cls._fields)

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\formats\tsv.py:30, in FileLineIter.__next__(self)
     28         self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc.stream()))
     29 while self.pos < self.start:
---> 30     line = self.stream.readline()
     31     if line != '\n':
     32         self.pos += 1

File ~\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
     22 def decode(self, input, final=False):
---> 23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined>

Expected behavior Decoding completes without error.

Additional context Screenshot: image

yuenherny avatar Sep 02 '22 21:09 yuenherny

The symbol — was all over the place in collections.tsv. Maybe this is what causing the error? image

yuenherny avatar Sep 02 '22 22:09 yuenherny

Referring to this SO thread, maybe this is the solution?

In Windows, the default encoding is cp1252, but that readme file is most likely encoded in UTF8.

The error message tells you that cp1252 codec is unable to decode the character with the byte 0x9D. When I browsed through the readme file, I found this character: ” (also known as: "RIGHT DOUBLE QUOTATION MARK"), which has the bytes 0xE2 0x80 0x9D, which includes the problematic byte.

From:

with open('README.txt') as file:
    long_description = file.read()

Change into:

with open('README.txt', encoding="utf8") as file:
    long_description = file.read()

This will open the file with the proper encoding.

When checking Line 30 of ir_datasets\formats\tsv.py, I found this line:

line = self.stream.readline()

and self.stream is using the io.TextIOWrapper instance.

Referring to the io docs here:

The default encoding of TextIOWrapper and open() is locale-specific (locale.getpreferredencoding(False)).

However, many developers forget to specify the encoding when opening text files encoded in UTF-8 (e.g. JSON, TOML, Markdown, etc…) since most Unix platforms use UTF-8 locale by default. This causes bugs because the locale encoding is not UTF-8 for most Windows users. ... Accordingly, it is highly recommended that you specify the encoding explicitly when opening text files. If you want to use UTF-8, pass encoding="utf-8". To use the current locale encoding, encoding="locale" is supported in Python 3.10.

yuenherny avatar Sep 02 '22 22:09 yuenherny

I modified Line 26 and 28 of ir_datasets\formats\tsv.py in my venv - adding encoding="utf-8" as an argument in io.TextIOWrapper:

def __next__(self):
        ...
        if self.stream is None:
            if isinstance(self.dlc, list):
                self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc[self.stream_idx].stream()), encoding="utf-8")
            else:
                self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc.stream()), encoding="utf-8")
        while self.pos < self.start:
            line = self.stream.readline()
            if line != '\n':
                self.pos += 1
       ...

and it runs without raising error now =) (I am using Windows 10) I would be pleased to open a PR on this, if the collaborators don't mind.

yuenherny avatar Sep 02 '22 22:09 yuenherny

Thanks! I suspect it's this issue: https://github.com/allenai/ir_datasets/issues/151

There's a branch that fixes it, but for some reason, it hasn't been merged into the main branch: https://github.com/allenai/ir_datasets/tree/encoding-fixes

I'll look into merging in the changes that have been made since the branch was made and look into pulling it into the main branch.

seanmacavaney avatar Sep 03 '22 08:09 seanmacavaney

It also looks like the FixEncoding module was bypassed, which is why you're getting all the characters like —. (FixEncoding replaces them with their correct unicode versions.)

As with #209, I recommend just letting ir_datasets do its thing automatically. Or, if you already have a file and don't want to wait for the downloads, follow the instructions provided by the system.

seanmacavaney avatar Sep 03 '22 08:09 seanmacavaney

Just to chime in, we've seen this same issue crop up with the irds:nfcorpus/dev dataset too. @seanmacavaney is there any updated on getting the encoding fix branched merged? I only ask because I assigned my class a homework involving this dataset and now students who use Windows are reporting not being able to load it without error.

davidjurgens avatar Sep 25 '22 03:09 davidjurgens