datasets Load codeparrot/apps raising UnicodeDecodeError in datasets-2.18.0

Describe the bug

This happens with datasets-2.18.0; I downgraded the version to 2.14.6 fixing this temporarily.

Traceback (most recent call last):
  File "/home/xxx/miniconda3/envs/py310/lib/python3.10/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/xxx/miniconda3/envs/py310/lib/python3.10/site-packages/datasets/load.py", line 2228, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/home/xxx/miniconda3/envs/py310/lib/python3.10/site-packages/datasets/load.py", line 1879, in dataset_module_factory
    raise e1 from None
  File "/home/xxx/miniconda3/envs/py310/lib/python3.10/site-packages/datasets/load.py", line 1831, in dataset_module_factory
    can_load_config_from_parquet_export = "DEFAULT_CONFIG_NAME" not in f.read()
  File "/home/xxx/miniconda3/envs/py310/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Steps to reproduce the bug

Using Python3.10/3.11
Install datasets-2.18.0
test with

from datasets import load_dataset
dataset = load_dataset("codeparrot/apps")

Expected behavior

Normally it should manage to download and load the dataset without such error.

Environment info

Ubuntu, Python3.10/3.11

Mar 28 '24 03:03 yucc-leon

The same error with mteb datasets.

Mar 29 '24 14:03 4IK1d

Unfortunately, I'm unable to reproduce this error locally or on Colab.

Apr 02 '24 13:04 mariosasko

Here is the requirements.txt from a clean virtual environment (managed by conda) where I only install datasets by pip install datasets. The pip list:

aiohttp==3.9.3
aiosignal==1.3.1
attrs==23.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
datasets==2.18.0
dill==0.3.8
filelock==3.13.3
frozenlist==1.4.1
fsspec==2024.2.0
huggingface-hub==0.22.2
idna==3.6
multidict==6.0.5
multiprocess==0.70.16
numpy==1.26.4
packaging==24.0
pandas==2.2.1
pyarrow==15.0.2
pyarrow-hotfix==0.6
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
requests==2.31.0
six==1.16.0
tqdm==4.66.2
typing_extensions==4.11.0
tzdata==2024.1
urllib3==2.2.1
xxhash==3.4.1
yarl==1.9.4

And the error can be reproduced.

Downgrading to datasets==2.14.6 changes some packages' versions:

Successfully installed datasets-2.14.6 dill-0.3.7 fsspec-2023.10.0 multiprocess-0.70.15

and the dataset can be downloaded and loaded.

Then I upgrade the version to 2.18.0 again; now the dataset can be loaded with such a line: Using the latest cached version of the module from /home/xxx/.cache/huggingface/modules/datasets_modules/datasets/codeparrot--apps/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5 (last modified on Sun Apr 7 09:06:43 2024) since it couldn't be found locally at codeparrot/apps, or remotely on the Hugging Face Hub.

So the latest version works wrong when requesting the dataset info.

But if you cannot reproduce this, I may ignore some detailed information: I use HF_ENDPOINT=https://hf-mirror.com for some reason (if not use this I cannot connect to huggingface resources) and the error occurs when requesting the dataset's info card.
Maybe the error is caused by this environment variable. I'll open an issue in the author's repo now.

Apr 07 '24 09:04 yucc-leon

Here is the requirements.txt from a clean virtual environment (managed by conda) where I only install datasets by pip install datasets. The pip list:
aiohttp==3.9.3
aiosignal==1.3.1
attrs==23.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
datasets==2.18.0
dill==0.3.8
filelock==3.13.3
frozenlist==1.4.1
fsspec==2024.2.0
huggingface-hub==0.22.2
idna==3.6
multidict==6.0.5
multiprocess==0.70.16
numpy==1.26.4
packaging==24.0
pandas==2.2.1
pyarrow==15.0.2
pyarrow-hotfix==0.6
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
requests==2.31.0
six==1.16.0
tqdm==4.66.2
typing_extensions==4.11.0
tzdata==2024.1
urllib3==2.2.1
xxhash==3.4.1
yarl==1.9.4
And the error can be reproduced.

Downgrading to datasets==2.14.6 changes some packages' versions:
Successfully installed datasets-2.14.6 dill-0.3.7 fsspec-2023.10.0 multiprocess-0.70.15
and the dataset can be downloaded and loaded.

Then I upgrade the version to 2.18.0 again; now the dataset can be loaded with such a line: Using the latest cached version of the module from /home/xxx/.cache/huggingface/modules/datasets_modules/datasets/codeparrot--apps/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5 (last modified on Sun Apr 7 09:06:43 2024) since it couldn't be found locally at codeparrot/apps, or remotely on the Hugging Face Hub.

So the latest version works wrong when requesting the dataset info.

But if you cannot reproduce this, I may ignore some detailed information: I use HF_ENDPOINT=https://hf-mirror.com for some reason (if not use this I cannot connect to huggingface resources) and the error occurs when requesting the dataset's info card. Maybe the error is caused by this environment variable. I'll open an issue in the author's repo now.

This is useful and my same error is settled!!!

Jun 19 '24 07:06 xuyuzhuang11