load_dataset ignores cached datasets and tries to hit HF Hub, resulting in API rate limit errors
Describe the bug
I have been running lm-eval-harness a lot which has results in an API rate limit. This seems strange, since all of the data should be cached locally. I have in fact verified this.
Steps to reproduce the bug
- Be Me
- Run
load_dataset("TAUR-Lab/MuSR") - Hit rate limit error
- Dataset is in .cache/huggingface/datasets
- ???
Expected behavior
We should not run into API rate limits if we have cached the dataset
Environment info
datasets 2.16.0 python 3.10.4
I'm having the same issue - running into rate limits when doing hyperparameter tuning even though the dataset is supposed to be cached. I feel like this behaviour should at the very least be documented, but honestly you should just not be running into rate limits in the first place when the dataset is cached. It even happens when specifying a specific revision for the dataset, in which case AFAIK there should be no reason to be doing API requests if it's already cached (besides maybe a quick hash check but hitting rate limits for that in ~200 requests across 10 hours of use seems a bit ridiculous).
I was running into the same issue and solved it by upgrading huggingface_hub from 0.33.5 to 0.34.0.
> python -m pip install huggingface_hub==0.34.0