datasets icon indicating copy to clipboard operation
datasets copied to clipboard

load_dataset ignores cached datasets and tries to hit HF Hub, resulting in API rate limit errors

Open tginart opened this issue 1 year ago • 2 comments

Describe the bug

I have been running lm-eval-harness a lot which has results in an API rate limit. This seems strange, since all of the data should be cached locally. I have in fact verified this.

Steps to reproduce the bug

  1. Be Me
  2. Run load_dataset("TAUR-Lab/MuSR")
  3. Hit rate limit error
  4. Dataset is in .cache/huggingface/datasets
  5. ???

Expected behavior

We should not run into API rate limits if we have cached the dataset

Environment info

datasets 2.16.0 python 3.10.4

tginart avatar Aug 02 '24 18:08 tginart

I'm having the same issue - running into rate limits when doing hyperparameter tuning even though the dataset is supposed to be cached. I feel like this behaviour should at the very least be documented, but honestly you should just not be running into rate limits in the first place when the dataset is cached. It even happens when specifying a specific revision for the dataset, in which case AFAIK there should be no reason to be doing API requests if it's already cached (besides maybe a quick hash check but hitting rate limits for that in ~200 requests across 10 hours of use seems a bit ridiculous).

luisgalan avatar Jun 16 '25 18:06 luisgalan

I was running into the same issue and solved it by upgrading huggingface_hub from 0.33.5 to 0.34.0.

> python -m pip install huggingface_hub==0.34.0

yunjae-won avatar Nov 21 '25 10:11 yunjae-won