datasets
datasets copied to clipboard
Fix issue with case sensitivity when loading dataset from local cache
When a dataset with upper-cases in its name is first loaded using load_dataset()
, the local cache directory is created with all lowercase letters.
However, upon subsequent loads, the current version attempts to locate the cache directory using the dataset's original name, which includes uppercase letters. This discrepancy can lead to confusion and, particularly in offline mode, results in errors.
Reproduce
~$ python
Python 3.9.19 (main, Mar 21 2024, 17:11:28)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from datasets import load_dataset
>>> dataset = load_dataset("locuslab/TOFU", "full")
>>> quit()
~$ export HF_DATASETS_OFFLINE=1
~$ python
Python 3.9.19 (main, Mar 21 2024, 17:11:28)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from datasets import load_dataset
>>> dataset = load_dataset("locuslab/TOFU", "full")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "xxxxxx/anaconda3/envs/llm/lib/python3.9/site-packages/datasets/load.py", line 2556, in load_dataset
builder_instance = load_dataset_builder(
File "xxxxxx/anaconda3/envs/llm/lib/python3.9/site-packages/datasets/load.py", line 2228, in load_dataset_builder
dataset_module = dataset_module_factory(
File "xxxxxx/anaconda3/envs/llm/lib/python3.9/site-packages/datasets/load.py", line 1871, in dataset_module_factory
raise ConnectionError(f"Couldn't reach the Hugging Face Hub for dataset '{path}': {e1}") from None
ConnectionError: Couldn't reach the Hugging Face Hub for dataset 'locuslab/TOFU': Offline mode is enabled.
>>>
I fix this issue by lowering the dataset name (.lower()
) when generating cache_dir.
I also need this feature for "Cnam-LMSSC/vibravox "
EDIT: Upgrading to 2.19.0
fixed my problem thanks to this PR