datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Got stuck when I trying to load a dataset

Open yirenpingsheng opened this issue 1 year ago • 6 comments

Describe the bug

Hello, everyone. I met a problem when I am trying to load a data file using load_dataset method on a Debian 10 system. The data file is not very large, only 1.63MB with 600 records. Here is my code:

from datasets import load_dataset

dataset = load_dataset('json', data_files='mypath/oaast_rm_zh.json')

I waited it for 20 minutes. It still no response. I cannot using Ctrl+C to cancel the command. I have to use Ctrl+Z to kill it. I also try it with a txt file, it still no response in a long time.

I can load the same file successfully using my laptop (windows 10, python 3.8.5, datasets==2.14.5). I can also make it on another computer (Ubuntu 20.04.5 LTS, python 3.10.13, datasets 2.14.7). It only takes me 1-2 miniutes.

Could you give me some suggestions? Thank you.

Steps to reproduce the bug

from datasets import load_dataset

dataset = load_dataset('json', data_files='mypath/oaast_rm_zh.json')

Expected behavior

I hope it can load the file successfully.

Environment info

OS: Debian GNU/Linux 10 Python: Python 3.10.13 Pip list: Package Version


accelerate 0.25.0 addict 2.4.0 aiofiles 23.2.1 aiohttp 3.9.1 aiosignal 1.3.1 aliyun-python-sdk-core 2.14.0 aliyun-python-sdk-kms 2.16.2 altair 5.2.0 annotated-types 0.6.0 anyio 3.7.1 async-timeout 4.0.3 attrs 23.1.0 certifi 2023.11.17 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 contourpy 1.2.0 crcmod 1.7 cryptography 41.0.7 cycler 0.12.1 datasets 2.14.7 dill 0.3.7 docstring-parser 0.15 einops 0.7.0 exceptiongroup 1.2.0 fastapi 0.105.0 ffmpy 0.3.1 filelock 3.13.1 fonttools 4.46.0 frozenlist 1.4.1 fsspec 2023.10.0 gast 0.5.4 gradio 3.50.2 gradio_client 0.6.1 h11 0.14.0 httpcore 1.0.2 httpx 0.25.2 huggingface-hub 0.19.4 idna 3.6 importlib-metadata 7.0.0 importlib-resources 6.1.1 jieba 0.42.1 Jinja2 3.1.2 jmespath 0.10.0 joblib 1.3.2 jsonschema 4.20.0 jsonschema-specifications 2023.11.2 kiwisolver 1.4.5 markdown-it-py 3.0.0 MarkupSafe 2.1.3 matplotlib 3.8.2 mdurl 0.1.2 modelscope 1.10.0 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.15 networkx 3.2.1 nltk 3.8.1 numpy 1.26.2 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.18.1 nvidia-nvjitlink-cu12 12.3.101 nvidia-nvtx-cu12 12.1.105 orjson 3.9.10 oss2 2.18.3 packaging 23.2 pandas 2.1.4 peft 0.7.1 Pillow 10.1.0 pip 23.3.1 platformdirs 4.1.0 protobuf 4.25.1 psutil 5.9.6 pyarrow 14.0.1 pyarrow-hotfix 0.6 pycparser 2.21 pycryptodome 3.19.0 pydantic 2.5.2 pydantic_core 2.14.5 pydub 0.25.1 Pygments 2.17.2 pyparsing 3.1.1 python-dateutil 2.8.2 python-multipart 0.0.6 pytz 2023.3.post1 PyYAML 6.0.1 referencing 0.32.0 regex 2023.10.3 requests 2.31.0 rich 13.7.0 rouge-chinese 1.0.3 rpds-py 0.13.2 safetensors 0.4.1 scipy 1.11.4 semantic-version 2.10.0 sentencepiece 0.1.99 setuptools 68.2.2 shtab 1.6.5 simplejson 3.19.2 six 1.16.0 sniffio 1.3.0 sortedcontainers 2.4.0 sse-starlette 1.8.2 starlette 0.27.0 sympy 1.12 tiktoken 0.5.2 tokenizers 0.15.0 tomli 2.0.1 toolz 0.12.0 torch 2.1.2 tqdm 4.66.1 transformers 4.36.1 triton 2.1.0 trl 0.7.4 typing_extensions 4.9.0 tyro 0.6.0 tzdata 2023.3 urllib3 2.1.0 uvicorn 0.24.0.post1 websockets 11.0.3 wheel 0.41.2 xxhash 3.4.1 yapf 0.40.2 yarl 1.9.4 zipp 3.17.0

yirenpingsheng avatar Dec 16 '23 11:12 yirenpingsheng

I ran into the same problem when I used a server cluster (Slurm system managed) that couldn't load any of the huggingface datasets or models, but it worked on my laptop. I suspected some system configuration-related problem, but I had no idea. My problems are consistent with issue #2618. All the huggingface-related libraries I use are the latest versions.

hutaiHang avatar Jan 06 '24 15:01 hutaiHang

I ran into the same problem when I used a server cluster (Slurm system managed) that couldn't load any of the huggingface datasets or models, but it worked on my laptop. I suspected some system configuration-related problem, but I had no idea. My problems are consistent with issue #2618. All the huggingface-related libraries I use are the latest versions.

have you solved this issue yet? i met the same problem on server but everything works on laptop. I think maybe the filelock repo is contradictory with file system.

P4rsee avatar Jan 18 '24 10:01 P4rsee

I am having the same issue on a computing cluster but this works on my laptop as well. I instead have this error: /home/.conda/envs/py10/lib/python3.10/site-packages/filelock/_unix.py", line 43, in _acquire fcntl.flock(fd, fcntl.LOCK_EX | fcntl.LOCK_NB) OSError: [Errno 5] Input/output error

the load_dataset command does not work on server for local or hosted hugging-face datasets, and I have tried for several files

jferina24 avatar Jan 19 '24 16:01 jferina24

Same here. Is there any solution?

chujiezheng avatar Feb 08 '24 21:02 chujiezheng

In my case, .cahce was in a shared folder. Moving it into the user's home folder fixed the problem. #2618 for more details

karray avatar Apr 05 '24 09:04 karray

In my case, .cahce was in a shared folder. Moving it into the user's home folder fixed the problem. #2618 for more details在我的情况下, .cahce 在一个共享文件夹中。将其移动到用户的主文件夹中解决了问题。 #2618 获取更多详细信息。

Can you be more specific? thank.

har77774 avatar May 10 '24 05:05 har77774