datasets Issue with offline mode

Describe the bug

I can't retrieve a cached dataset with offline mode enabled

Steps to reproduce the bug

To reproduce my issue, first, you'll need to run a script that will cache the dataset

import os
os.environ["HF_DATASETS_OFFLINE"] = "0"

import datasets

datasets.logging.set_verbosity_info()
ds_name = "SaulLu/toy_struc_dataset"
ds = datasets.load_dataset(ds_name)
print(ds)

then, you can try to reload it in offline mode:

import os
os.environ["HF_DATASETS_OFFLINE"] = "1"

import datasets

datasets.logging.set_verbosity_info()
ds_name = "SaulLu/toy_struc_dataset"
ds = datasets.load_dataset(ds_name)
print(ds)

Expected results

I would have expected the 2nd snippet not to return any errors

Actual results

The 2nd snippet returns:

Traceback (most recent call last):
  File "/home/lucile_huggingface_co/sandbox/evaluate/test_cache_datasets.py", line 8, in <module>
    ds = datasets.load_dataset(ds_name)
  File "/home/lucile_huggingface_co/anaconda3/envs/evaluate-dev/lib/python3.8/site-packages/datasets/load.py", line 1723, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/lucile_huggingface_co/anaconda3/envs/evaluate-dev/lib/python3.8/site-packages/datasets/load.py", line 1500, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/home/lucile_huggingface_co/anaconda3/envs/evaluate-dev/lib/python3.8/site-packages/datasets/load.py", line 1241, in dataset_module_factory
    raise ConnectionError(f"Couln't reach the Hugging Face Hub for dataset '{path}': {e1}") from None
ConnectionError: Couln't reach the Hugging Face Hub for dataset 'SaulLu/toy_struc_dataset': Offline mode is enabled.

Environment info

datasets version: 2.4.0
Platform: Linux-4.19.0-21-cloud-amd64-x86_64-with-glibc2.17
Python version: 3.8.13
PyArrow version: 8.0.0
Pandas version: 1.4.3

Maybe I'm misunderstanding something in the use of the offline mode (see doc), is that the case?

Jul 28 '22 12:07 SaulLu

Hi @SaulLu, thanks for reporting.

I think offline mode is not supported for datasets containing only data files (without any loading script). I'm having a look into this...

Jul 28 '22 15:07 albertvillanova

Thanks for your feedback!

To give you a little more info, if you don't set the offline mode flag, the script will load the cache. I first noticed this behavior with the evaluate library, and while trying to understand the downloading flow I realized that I had a similar error with datasets.

Jul 28 '22 15:07 SaulLu

This is an issue we have to fix.

Jul 28 '22 15:07 albertvillanova

This is related to https://github.com/huggingface/datasets/issues/3547

Jul 28 '22 16:07 lhoestq

Still not fixed? ......

May 10 '23 13:05 thuzhf

#5331 will be helpful to fix this, as it updates the cache directory template to be aligned with the other datasets

May 11 '23 10:05 lhoestq

Any updates ?

Oct 26 '23 13:10 ManuelFay

I'm facing the same problem

Jan 23 '24 01:01 je-santos

This issue has been fixed in datasets 2.16 by https://github.com/huggingface/datasets/pull/6493. The cache is now working properly :)

You just have to update datasets:

pip install -U datasets

Jan 23 '24 10:01 lhoestq

I'm on version 2.17.0, and this exact problem is still persisting.

Feb 15 '24 17:02 jaded0

Can you share some code to reproduce your issue ?

Also make sure your cache was populated with recent versions of datasets. Datasets cached with old versions may not be reloadable in offline mode, though we did our best to keep as much backward compatibility as possible.

Feb 15 '24 17:02 lhoestq

I'm not sure if this is related @lhoestq but I am experiencing a similar issue when using offline mode:

$ python -c "from datasets import load_dataset; load_dataset('openai_humaneval', split='test')"
$ HF_DATASETS_OFFLINE=1 python -c "from datasets import load_dataset; load_dataset('openai_humaneval', split='test')"
Using the latest cached version of the dataset since openai_humaneval couldn't be found on the Hugging Face Hub (offline mode is enabled).
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/dodrio/scratch/projects/2023_071/alignment-handbook/.venv/lib/python3.10/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
  File "/dodrio/scratch/projects/2023_071/alignment-handbook/.venv/lib/python3.10/site-packages/datasets/load.py", line 2265, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/dodrio/scratch/projects/2023_071/alignment-handbook/.venv/lib/python3.10/site-packages/datasets/packaged_modules/cache/cache.py", line 122, in __init__
    config_name, version, hash = _find_hash_in_cache(
  File "/dodrio/scratch/projects/2023_071/alignment-handbook/.venv/lib/python3.10/site-packages/datasets/packaged_modules/cache/cache.py", line 48, in _find_hash_in_cache
    raise ValueError(
ValueError: Couldn't find cache for openai_humaneval for config 'default'
Available configs in the cache: ['openai_humaneval']

Mar 19 '24 08:03 BramVanroy

Thanks for reporting @BramVanroy, I managed to reproduce and I opened a fix here: https://github.com/huggingface/datasets/pull/6741

Mar 19 '24 10:03 lhoestq

Awesome, thanks for the quick fix @lhoestq! Looking forward to update my dependency version list.

Mar 19 '24 11:03 BramVanroy

Thanks for reporting @BramVanroy, I managed to reproduce and I opened a fix here: #6741

Thanks a lot！ I have faced the same problem. Can I use your fix code to directly replace the existing version code? I noticed that this fix has not been merged yet. Will it affect other functionalities?

Mar 25 '24 09:03 noforit

I just merged the fix, you can install datasets from source or wait for the patch release which will be out in the coming days

Mar 25 '24 16:03 lhoestq

datasets datasets copied to clipboard

Issue with offline mode

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

datasets
datasets copied to clipboard