Load codeparrot/apps raising UnicodeDecodeError in datasets-2.18.0
Describe the bug
This happens with datasets-2.18.0; I downgraded the version to 2.14.6 fixing this temporarily.
Traceback (most recent call last):
File "/home/xxx/miniconda3/envs/py310/lib/python3.10/site-packages/datasets/load.py", line 2556, in load_dataset
builder_instance = load_dataset_builder(
File "/home/xxx/miniconda3/envs/py310/lib/python3.10/site-packages/datasets/load.py", line 2228, in load_dataset_builder
dataset_module = dataset_module_factory(
File "/home/xxx/miniconda3/envs/py310/lib/python3.10/site-packages/datasets/load.py", line 1879, in dataset_module_factory
raise e1 from None
File "/home/xxx/miniconda3/envs/py310/lib/python3.10/site-packages/datasets/load.py", line 1831, in dataset_module_factory
can_load_config_from_parquet_export = "DEFAULT_CONFIG_NAME" not in f.read()
File "/home/xxx/miniconda3/envs/py310/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
Steps to reproduce the bug
- Using Python3.10/3.11
- Install datasets-2.18.0
- test with
from datasets import load_dataset
dataset = load_dataset("codeparrot/apps")
Expected behavior
Normally it should manage to download and load the dataset without such error.
Environment info
Ubuntu, Python3.10/3.11
The same error with mteb datasets.
Unfortunately, I'm unable to reproduce this error locally or on Colab.
Here is the requirements.txt from a clean virtual environment (managed by conda) where I only install datasets by
pip install datasets.
The pip list:
aiohttp==3.9.3
aiosignal==1.3.1
attrs==23.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
datasets==2.18.0
dill==0.3.8
filelock==3.13.3
frozenlist==1.4.1
fsspec==2024.2.0
huggingface-hub==0.22.2
idna==3.6
multidict==6.0.5
multiprocess==0.70.16
numpy==1.26.4
packaging==24.0
pandas==2.2.1
pyarrow==15.0.2
pyarrow-hotfix==0.6
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
requests==2.31.0
six==1.16.0
tqdm==4.66.2
typing_extensions==4.11.0
tzdata==2024.1
urllib3==2.2.1
xxhash==3.4.1
yarl==1.9.4
And the error can be reproduced.
Downgrading to datasets==2.14.6 changes some packages' versions:
Successfully installed datasets-2.14.6 dill-0.3.7 fsspec-2023.10.0 multiprocess-0.70.15
and the dataset can be downloaded and loaded.
Then I upgrade the version to 2.18.0 again; now the dataset can be loaded with such a line:
Using the latest cached version of the module from /home/xxx/.cache/huggingface/modules/datasets_modules/datasets/codeparrot--apps/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5 (last modified on Sun Apr 7 09:06:43 2024) since it couldn't be found locally at codeparrot/apps, or remotely on the Hugging Face Hub.
So the latest version works wrong when requesting the dataset info.
But if you cannot reproduce this, I may ignore some detailed information: I use HF_ENDPOINT=https://hf-mirror.com for some reason (if not use this I cannot connect to huggingface resources) and the error occurs when requesting the dataset's info card.
Maybe the error is caused by this environment variable.
I'll open an issue in the author's repo now.
Here is the requirements.txt from a clean virtual environment (managed by conda) where I only install
datasetsbypip install datasets. The pip list:aiohttp==3.9.3 aiosignal==1.3.1 attrs==23.2.0 certifi==2024.2.2 charset-normalizer==3.3.2 datasets==2.18.0 dill==0.3.8 filelock==3.13.3 frozenlist==1.4.1 fsspec==2024.2.0 huggingface-hub==0.22.2 idna==3.6 multidict==6.0.5 multiprocess==0.70.16 numpy==1.26.4 packaging==24.0 pandas==2.2.1 pyarrow==15.0.2 pyarrow-hotfix==0.6 python-dateutil==2.9.0.post0 pytz==2024.1 PyYAML==6.0.1 requests==2.31.0 six==1.16.0 tqdm==4.66.2 typing_extensions==4.11.0 tzdata==2024.1 urllib3==2.2.1 xxhash==3.4.1 yarl==1.9.4And the error can be reproduced.
Downgrading to datasets==2.14.6 changes some packages' versions:
Successfully installed datasets-2.14.6 dill-0.3.7 fsspec-2023.10.0 multiprocess-0.70.15and the dataset can be downloaded and loaded.
Then I upgrade the version to 2.18.0 again; now the dataset can be loaded with such a line:
Using the latest cached version of the module from /home/xxx/.cache/huggingface/modules/datasets_modules/datasets/codeparrot--apps/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5 (last modified on Sun Apr 7 09:06:43 2024) since it couldn't be found locally at codeparrot/apps, or remotely on the Hugging Face Hub.So the latest version works wrong when requesting the dataset info.
But if you cannot reproduce this, I may ignore some detailed information: I use
HF_ENDPOINT=https://hf-mirror.comfor some reason (if not use this I cannot connect to huggingface resources) and the error occurs when requesting the dataset's info card. Maybe the error is caused by this environment variable. I'll open an issue in the author's repo now.
This is useful and my same error is settled!!!