datasets
datasets copied to clipboard
Issue with offline mode
Describe the bug
I can't retrieve a cached dataset with offline mode enabled
Steps to reproduce the bug
To reproduce my issue, first, you'll need to run a script that will cache the dataset
import os
os.environ["HF_DATASETS_OFFLINE"] = "0"
import datasets
datasets.logging.set_verbosity_info()
ds_name = "SaulLu/toy_struc_dataset"
ds = datasets.load_dataset(ds_name)
print(ds)
then, you can try to reload it in offline mode:
import os
os.environ["HF_DATASETS_OFFLINE"] = "1"
import datasets
datasets.logging.set_verbosity_info()
ds_name = "SaulLu/toy_struc_dataset"
ds = datasets.load_dataset(ds_name)
print(ds)
Expected results
I would have expected the 2nd snippet not to return any errors
Actual results
The 2nd snippet returns:
Traceback (most recent call last):
File "/home/lucile_huggingface_co/sandbox/evaluate/test_cache_datasets.py", line 8, in <module>
ds = datasets.load_dataset(ds_name)
File "/home/lucile_huggingface_co/anaconda3/envs/evaluate-dev/lib/python3.8/site-packages/datasets/load.py", line 1723, in load_dataset
builder_instance = load_dataset_builder(
File "/home/lucile_huggingface_co/anaconda3/envs/evaluate-dev/lib/python3.8/site-packages/datasets/load.py", line 1500, in load_dataset_builder
dataset_module = dataset_module_factory(
File "/home/lucile_huggingface_co/anaconda3/envs/evaluate-dev/lib/python3.8/site-packages/datasets/load.py", line 1241, in dataset_module_factory
raise ConnectionError(f"Couln't reach the Hugging Face Hub for dataset '{path}': {e1}") from None
ConnectionError: Couln't reach the Hugging Face Hub for dataset 'SaulLu/toy_struc_dataset': Offline mode is enabled.
Environment info
-
datasets
version: 2.4.0 - Platform: Linux-4.19.0-21-cloud-amd64-x86_64-with-glibc2.17
- Python version: 3.8.13
- PyArrow version: 8.0.0
- Pandas version: 1.4.3
Maybe I'm misunderstanding something in the use of the offline mode (see doc), is that the case?
Hi @SaulLu, thanks for reporting.
I think offline mode is not supported for datasets containing only data files (without any loading script). I'm having a look into this...
Thanks for your feedback!
To give you a little more info, if you don't set the offline mode flag, the script will load the cache. I first noticed this behavior with the evaluate
library, and while trying to understand the downloading flow I realized that I had a similar error with datasets.
This is an issue we have to fix.
This is related to https://github.com/huggingface/datasets/issues/3547
Still not fixed? ......
#5331 will be helpful to fix this, as it updates the cache directory template to be aligned with the other datasets
Any updates ?
I'm facing the same problem
This issue has been fixed in datasets
2.16 by https://github.com/huggingface/datasets/pull/6493. The cache is now working properly :)
You just have to update datasets
:
pip install -U datasets
I'm on version 2.17.0, and this exact problem is still persisting.
Can you share some code to reproduce your issue ?
Also make sure your cache was populated with recent versions of datasets
. Datasets cached with old versions may not be reloadable in offline mode, though we did our best to keep as much backward compatibility as possible.
I'm not sure if this is related @lhoestq but I am experiencing a similar issue when using offline mode:
$ python -c "from datasets import load_dataset; load_dataset('openai_humaneval', split='test')"
$ HF_DATASETS_OFFLINE=1 python -c "from datasets import load_dataset; load_dataset('openai_humaneval', split='test')"
Using the latest cached version of the dataset since openai_humaneval couldn't be found on the Hugging Face Hub (offline mode is enabled).
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/dodrio/scratch/projects/2023_071/alignment-handbook/.venv/lib/python3.10/site-packages/datasets/load.py", line 2556, in load_dataset
builder_instance = load_dataset_builder(
File "/dodrio/scratch/projects/2023_071/alignment-handbook/.venv/lib/python3.10/site-packages/datasets/load.py", line 2265, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
File "/dodrio/scratch/projects/2023_071/alignment-handbook/.venv/lib/python3.10/site-packages/datasets/packaged_modules/cache/cache.py", line 122, in __init__
config_name, version, hash = _find_hash_in_cache(
File "/dodrio/scratch/projects/2023_071/alignment-handbook/.venv/lib/python3.10/site-packages/datasets/packaged_modules/cache/cache.py", line 48, in _find_hash_in_cache
raise ValueError(
ValueError: Couldn't find cache for openai_humaneval for config 'default'
Available configs in the cache: ['openai_humaneval']
Thanks for reporting @BramVanroy, I managed to reproduce and I opened a fix here: https://github.com/huggingface/datasets/pull/6741
Awesome, thanks for the quick fix @lhoestq! Looking forward to update my dependency version list.
Thanks for reporting @BramVanroy, I managed to reproduce and I opened a fix here: #6741
Thanks a lot! I have faced the same problem. Can I use your fix code to directly replace the existing version code? I noticed that this fix has not been merged yet. Will it affect other functionalities?
I just merged the fix, you can install datasets
from source or wait for the patch release which will be out in the coming days