datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Cannot use cached dataset without Internet connection (or when servers are down)

Open DionisMuzenitov opened this issue 1 year ago • 4 comments

Describe the bug

I want to be able to use cached dataset from HuggingFace even when I have no Internet connection (or when HuggingFace servers are down, or my company has network issues).
The problem why I can't use it: data_files argument from datasets.load_dataset() function get it updates from the server before calculating hash for caching. As a result, when I run the same code with and without Internet I get different dataset configuration directory name.

Steps to reproduce the bug

import datasets

c4_dataset = datasets.load_dataset(
    path="allenai/c4",
    data_files={"train": "en/c4-train.00000-of-01024.json.gz"},
    split="train",
    cache_dir="/datesets/cache",
    download_mode="reuse_cache_if_exists",
    token=False,
)
  1. Run this code with the Internet.
  2. Run the same code without the Internet.

Expected behavior

When running without the Internet connection, the loader should be able to get dataset from cache

Environment info

  • datasets version: 2.19.0
  • Platform: Windows-10-10.0.19044-SP0
  • Python version: 3.10.13
  • huggingface_hub version: 0.22.2
  • PyArrow version: 16.0.0
  • Pandas version: 1.5.3
  • fsspec version: 2023.12.2

DionisMuzenitov avatar Apr 25 '24 10:04 DionisMuzenitov

There are 2 workarounds, tho:

  1. Download datasets from web and just load them locally
  2. Use metadata directly (temporal solution, since metadata can change)
import datasets
from datasets.data_files import DataFilesDict, DataFilesList

data_files_list = DataFilesList(
    [
        "hf://datasets/allenai/c4@1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-train.00000-of-01024.json.gz"
    ],
    [("allenai/c4", "1588ec454efa1a09f29cd18ddd04fe05fc8653a2")],
)
data_files = DataFilesDict({"train": data_files_list})
c4_dataset = datasets.load_dataset(
    path="allenai/c4",
    data_files=data_files,
    split="train",
    cache_dir="/datesets/cache",
    download_mode="reuse_cache_if_exists",
    token=False,
)

Second solution also shows where to find the bug. I suggest that the hashing functions should always use only original parameter data_files, and not the one they get after connecting to the server and creating DataFilesDict

DionisMuzenitov avatar Apr 25 '24 10:04 DionisMuzenitov

Hi! You need to set the HF_DATASETS_OFFLINE env variable to 1 to load cached datasets offline, as explained in the docs here.

mariosasko avatar Apr 26 '24 13:04 mariosasko

Just tested. It doesn't work, because of the exact problem I described above: hash of dataset config is different. The only error difference is the reason why it cannot connect to HuggingFace (now it's 'offline mode is enabled') image

DionisMuzenitov avatar Apr 26 '24 14:04 DionisMuzenitov

Met a pretty similar issue here, as I manually load the dataset into ~/.cache and try to let load_dataset detect it automatically, but it will always try reach hub even I set HF_DATASETS_OFFLINE to 1. Have you solved it?

ErikaaWang avatar Jul 19 '24 08:07 ErikaaWang