datasets
datasets copied to clipboard
Datasets created with `push_to_hub` can't be accessed in offline mode
Describe the bug
In offline mode, one can still access previously-cached datasets. This fails with datasets created with push_to_hub
.
Steps to reproduce the bug
in Python:
import datasets
mpwiki = datasets.load_dataset("teven/matched_passages_wikidata")
in bash:
export HF_DATASETS_OFFLINE=1
in Python:
import datasets
mpwiki = datasets.load_dataset("teven/matched_passages_wikidata")
Expected results
datasets
should find the previously-cached dataset.
Actual results
ConnectionError: Couln't reach the Hugging Face Hub for dataset 'teven/matched_passages_wikidata': Offline mode is enabled
Environment info
-
datasets
version: 1.16.2.dev0 - Platform: Linux-4.18.0-193.70.1.el8_2.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.10
- PyArrow version: 3.0.0
Thanks for reporting. I think this can be fixed by improving the CachedDatasetModuleFactory
and making it look into the parquet
cache directory (datasets from push_to_hub are loaded with the parquet dataset builder). I'll look into it
Hi, I'm having the same issue. Is there any update on this?
We haven't had a chance to fix this yet. If someone would like to give it a try I'd be happy to give some guidance
@lhoestq Do you have an idea of what changes need to be made to CachedDatasetModuleFactory
? I would be willing to take a crack at it. Currently unable to train with datasets I have push_to_hub
on a cluster whose compute nodes are not connected to the internet.
It looks like it might be this line:
https://github.com/huggingface/datasets/blob/0c1d099f87a883e52c42d3fd1f1052ad3967e647/src/datasets/load.py#L994
Which wouldn't pick up the stuff saved under "datasets/allenai___parquet/*"
. Additionally, the datasets saved under "datasets/allenai___parquet/*"
appear to have hashes in their name, e.g. "datasets/allenai___parquet/my_dataset-def9ee5552a1043e"
. This would not be detected by CachedDatasetModuleFactory
, which currently looks for subdirectories
https://github.com/huggingface/datasets/blob/0c1d099f87a883e52c42d3fd1f1052ad3967e647/src/datasets/load.py#L995-L999
importable_directory_path
is used to find a dataset script that was previously downloaded and cached from the Hub
However in your case there's no dataset script on the Hub, only parquet files. So the logic must be extended for this case.
In particular I think you can add a new logic in the case where hashes is None
(i.e. if there's no dataset script associated to the dataset in the cache).
In this case you can check directly in the in the datasets cache for a directory named <namespace>__parquet
and a subdirectory named <config_id>
. The config_id must match {self.name.replace("/", "--")}-*
.
In your case those two directories correspond to allenai___parquet
and then allenai--my_dataset-def9ee5552a1043e
Then you can find the most recent version of the dataset in subdirectories (e.g. sorting using the last modified time of the dataset_info.json
file).
Finally, we will need return the module that is used to load the dataset from the cache. It is the same module than the one that would have been normally used if you had an internet connection.
At that point you can ping me, because we will need to pass all this:
-
module_path = _PACKAGED_DATASETS_MODULES["parquet"][0]
-
hash
it corresponds the name of the directory that contains the .arrow file, inside<namespace>__parquet/<config_id>
-
builder_kwargs = {"hash": hash, "repo_id": self.name, "config_id": config_id}
and currentlyconfig_id
is not a valid argument for aDatasetBuilder
I think in the future we want to change this caching logic completely, since I don't find it super easy to play with.
Hi! Is there a workaround for the time being?
Like passing data_dir
or something like that?
I would like to use this diffuser example on my cluster whose nodes are not connected to the internet. I have downloaded the dataset online form the login node.
Hi ! Yes you can save your dataset locally with my_dataset.save_to_disk("path/to/local")
and reload it later with load_from_disk("path/to/local")
(removing myself from assignees since I'm currently not working on this right now)
Still not fixed? ......
Any idea @lhoestq who to tag to fix this ? This is a very annoying bug, which is becoming more and more present since the push_to_hub API is getting used more ?
Perhaps @mariosasko ? Thanks a lot for the great work on the lib !
It should be easier to implement now that we improved the caching of datasets from push_to_hub
: each dataset has its own directory in the cache.
The cache structure has been improved in https://github.com/huggingface/datasets/pull/5331. Now the cache structure is "{namespace__}<dataset_name>/<config_name>/<version>/<hash>/"
which contains the arrow files "<dataset_name>-<split>.arrow"
and "dataset_info.json"
.
The idea is to extend CachedDatasetModuleFactory
to also check if this directory exists in the cache (in addition to the already existing cache check) and return the requested dataset module. The module name can be found in the JSON file in the builder_name
field.
Any progress?
I started a PR to draft the logic to reload datasets from the cache fi they were created with push_to_hub: https://github.com/huggingface/datasets/pull/6459
Feel free to try it out
It seems that this does not support dataset with uppercase name
Which version of datasets
are you using ? This issue has been fixed with datasets
2.16
I can confirm that this problem is still happening with datasets
2.17.0, installed from pip
Can you share a code or a dataset that reproduces the issue ? It seems to work fine on my side.
Yeah,
dataset = load_dataset("roneneldan/TinyStories")
I tried it with:
dataset = load_dataset("roneneldan/tinystories")
and it worked.
It seems that this does not support dataset with uppercase name
@fecet was right, but if you just put the name lowercase, it works.