Cannot use cached dataset without Internet connection (or when servers are down)
Describe the bug
I want to be able to use cached dataset from HuggingFace even when I have no Internet connection (or when HuggingFace servers are down, or my company has network issues).
The problem why I can't use it:
data_files argument from datasets.load_dataset() function get it updates from the server before calculating hash for caching. As a result, when I run the same code with and without Internet I get different dataset configuration directory name.
Steps to reproduce the bug
import datasets
c4_dataset = datasets.load_dataset(
path="allenai/c4",
data_files={"train": "en/c4-train.00000-of-01024.json.gz"},
split="train",
cache_dir="/datesets/cache",
download_mode="reuse_cache_if_exists",
token=False,
)
- Run this code with the Internet.
- Run the same code without the Internet.
Expected behavior
When running without the Internet connection, the loader should be able to get dataset from cache
Environment info
datasetsversion: 2.19.0- Platform: Windows-10-10.0.19044-SP0
- Python version: 3.10.13
huggingface_hubversion: 0.22.2- PyArrow version: 16.0.0
- Pandas version: 1.5.3
fsspecversion: 2023.12.2
There are 2 workarounds, tho:
- Download datasets from web and just load them locally
- Use metadata directly (temporal solution, since metadata can change)
import datasets
from datasets.data_files import DataFilesDict, DataFilesList
data_files_list = DataFilesList(
[
"hf://datasets/allenai/c4@1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-train.00000-of-01024.json.gz"
],
[("allenai/c4", "1588ec454efa1a09f29cd18ddd04fe05fc8653a2")],
)
data_files = DataFilesDict({"train": data_files_list})
c4_dataset = datasets.load_dataset(
path="allenai/c4",
data_files=data_files,
split="train",
cache_dir="/datesets/cache",
download_mode="reuse_cache_if_exists",
token=False,
)
Second solution also shows where to find the bug. I suggest that the hashing functions should always use only original parameter data_files, and not the one they get after connecting to the server and creating DataFilesDict
Hi! You need to set the HF_DATASETS_OFFLINE env variable to 1 to load cached datasets offline, as explained in the docs here.
Just tested. It doesn't work, because of the exact problem I described above: hash of dataset config is different.
The only error difference is the reason why it cannot connect to HuggingFace (now it's 'offline mode is enabled')
Met a pretty similar issue here, as I manually load the dataset into ~/.cache and try to let load_dataset detect it automatically, but it will always try reach hub even I set HF_DATASETS_OFFLINE to 1. Have you solved it?