datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Fix issue with case sensitivity when loading dataset from local cache

Open Sumsky21 opened this issue 3 months ago • 1 comments

When a dataset with upper-cases in its name is first loaded using load_dataset(), the local cache directory is created with all lowercase letters.

However, upon subsequent loads, the current version attempts to locate the cache directory using the dataset's original name, which includes uppercase letters. This discrepancy can lead to confusion and, particularly in offline mode, results in errors.

Reproduce

~$ python            
Python 3.9.19 (main, Mar 21 2024, 17:11:28) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from datasets import load_dataset
>>> dataset = load_dataset("locuslab/TOFU", "full")
>>> quit()

~$ export HF_DATASETS_OFFLINE=1 
~$ python                      
Python 3.9.19 (main, Mar 21 2024, 17:11:28) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from datasets import load_dataset
>>> dataset = load_dataset("locuslab/TOFU", "full")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "xxxxxx/anaconda3/envs/llm/lib/python3.9/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
  File "xxxxxx/anaconda3/envs/llm/lib/python3.9/site-packages/datasets/load.py", line 2228, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "xxxxxx/anaconda3/envs/llm/lib/python3.9/site-packages/datasets/load.py", line 1871, in dataset_module_factory
    raise ConnectionError(f"Couldn't reach the Hugging Face Hub for dataset '{path}': {e1}") from None
ConnectionError: Couldn't reach the Hugging Face Hub for dataset 'locuslab/TOFU': Offline mode is enabled.
>>> 

I fix this issue by lowering the dataset name (.lower()) when generating cache_dir.

Sumsky21 avatar Mar 28 '24 14:03 Sumsky21

I also need this feature for "Cnam-LMSSC/vibravox "

EDIT: Upgrading to 2.19.0 fixed my problem thanks to this PR

jhauret avatar Apr 17 '24 16:04 jhauret