hf-mirror-site icon indicating copy to clipboard operation
hf-mirror-site copied to clipboard

使用镜像作为环境变量后部分数据在最新版 datasets 上无法正常下载

Open yucc-leon opened this issue 10 months ago • 15 comments

干净的环境,Python=3.11,只安装了 datasets(==2.18.0)

from datasets import load_dataset
dataset = load_dataset("codeparrot/apps")

会提示

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xxx/miniforge3/envs/test/lib/python3.11/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/miniforge3/envs/test/lib/python3.11/site-packages/datasets/load.py", line 2228, in load_dataset_builder
    dataset_module = dataset_module_factory(
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/miniforge3/envs/test/lib/python3.11/site-packages/datasets/load.py", line 1879, in dataset_module_factory
    raise e1 from None
  File "/home/xxx/miniforge3/envs/test/lib/python3.11/site-packages/datasets/load.py", line 1831, in dataset_module_factory
    can_load_config_from_parquet_export = "DEFAULT_CONFIG_NAME" not in f.read()
                                                                       ^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

官方开发人员无法复现,由于该错误仅出现在读取数据文件前向服务器请求数据集信息的阶段,因此猜测问题出在镜像上(参考https://github.com/huggingface/datasets/issues/6760 中的反馈)

yucc-leon avatar Apr 07 '24 09:04 yucc-leon