hf-mirror-site
hf-mirror-site copied to clipboard
使用镜像作为环境变量后部分数据在最新版 datasets 上无法正常下载
干净的环境,Python=3.11,只安装了 datasets(==2.18.0)
from datasets import load_dataset
dataset = load_dataset("codeparrot/apps")
会提示
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/xxx/miniforge3/envs/test/lib/python3.11/site-packages/datasets/load.py", line 2556, in load_dataset
builder_instance = load_dataset_builder(
^^^^^^^^^^^^^^^^^^^^^
File "/home/xxx/miniforge3/envs/test/lib/python3.11/site-packages/datasets/load.py", line 2228, in load_dataset_builder
dataset_module = dataset_module_factory(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xxx/miniforge3/envs/test/lib/python3.11/site-packages/datasets/load.py", line 1879, in dataset_module_factory
raise e1 from None
File "/home/xxx/miniforge3/envs/test/lib/python3.11/site-packages/datasets/load.py", line 1831, in dataset_module_factory
can_load_config_from_parquet_export = "DEFAULT_CONFIG_NAME" not in f.read()
^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
官方开发人员无法复现,由于该错误仅出现在读取数据文件前向服务器请求数据集信息的阶段,因此猜测问题出在镜像上(参考https://github.com/huggingface/datasets/issues/6760 中的反馈)