lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Allow Task objects to defer dataset download
currently Task object initialization causes the tasks to download their datasets upon initialization.
This is not always desirable, so we should allow users to defer download and perform it manually.
Should add a defer_download: bool = False flag to Task and ConfigurableTask init() methods which, when set to true, has the dataset not downloaded, and allow users to easily call a task.download() method with no args that performs the download.
If users attempt to run the task without first downloading the dataset, we should raise an error.
Hi @haileyschoelkopf
I have started looking into this. I am confused as to why even though the Task class has a download method:
https://github.com/EleutherAI/lm-evaluation-harness/blob/86319a9b14ddae2030bc6e0fdddd47fc7d0bb525/lm_eval/api/task.py#L236-L240
the ConfigurableTask class is the only class which inherits from it and overrides the download method. Wouldn't it be simpler to just define the download method in the Task class as follows:
def download(
self,
data_dir: Optional[str] = None,
cache_dir: Optional[str] = None,
download_mode=None,
**kwargs, # <--- allow for additional kwargs
) -> None:
"""Downloads and returns the task dataset.
Override this method to download the dataset from a custom API.
:param data_dir: str
Stores the path to a local folder containing the `Task`'s data files.
Use this to specify the path to manually downloaded data (usually when
the dataset is not publicly accessible).
:param cache_dir: str
The directory to read/write the `Task` dataset. This follows the
HuggingFace `datasets` API with the default cache directory located at:
`~/.cache/huggingface/datasets`
NOTE: You can change the cache location globally for a given process
by setting the shell environment variable, `HF_DATASETS_CACHE`,
to another directory:
`export HF_DATASETS_CACHE="/path/to/another/directory"`
:param download_mode: datasets.DownloadMode
How to treat pre-existing `Task` downloads and data.
- `datasets.DownloadMode.REUSE_DATASET_IF_EXISTS`
Reuse download and reuse dataset.
- `datasets.DownloadMode.REUSE_CACHE_IF_EXISTS`
Reuse download with fresh dataset.
- `datasets.DownloadMode.FORCE_REDOWNLOAD`
Fresh download and fresh dataset.
"""
self.dataset = datasets.load_dataset(
path=self.DATASET_PATH,
name=self.DATASET_NAME,
data_dir=data_dir,
cache_dir=cache_dir,
download_mode=download_mode,
**kwargs, # <--- pass the additional kwargs
)
And remove the download method in the ConfigurableTask.
What do you think?
Yes, I think these should be unified.
One thing to check is that despite passing cache_dir directly in the Task.download() method, the HF environment variables should take precedence.
Any progress on this? it is so inconvenient to pass local data for evaluation.