lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

Allow Task objects to defer dataset download

Open haileyschoelkopf opened this issue 1 year ago • 3 comments
trafficstars

currently Task object initialization causes the tasks to download their datasets upon initialization.

This is not always desirable, so we should allow users to defer download and perform it manually.

Should add a defer_download: bool = False flag to Task and ConfigurableTask init() methods which, when set to true, has the dataset not downloaded, and allow users to easily call a task.download() method with no args that performs the download.

If users attempt to run the task without first downloading the dataset, we should raise an error.

haileyschoelkopf avatar Mar 11 '24 16:03 haileyschoelkopf

Hi @haileyschoelkopf

I have started looking into this. I am confused as to why even though the Task class has a download method: https://github.com/EleutherAI/lm-evaluation-harness/blob/86319a9b14ddae2030bc6e0fdddd47fc7d0bb525/lm_eval/api/task.py#L236-L240

the ConfigurableTask class is the only class which inherits from it and overrides the download method. Wouldn't it be simpler to just define the download method in the Task class as follows:

def download(
        self,
        data_dir: Optional[str] = None,
        cache_dir: Optional[str] = None,
        download_mode=None,
        **kwargs,         # <--- allow for additional kwargs
    ) -> None:
        """Downloads and returns the task dataset.
        Override this method to download the dataset from a custom API.

        :param data_dir: str
            Stores the path to a local folder containing the `Task`'s data files.
            Use this to specify the path to manually downloaded data (usually when
            the dataset is not publicly accessible).
        :param cache_dir: str
            The directory to read/write the `Task` dataset. This follows the
            HuggingFace `datasets` API with the default cache directory located at:
                `~/.cache/huggingface/datasets`
            NOTE: You can change the cache location globally for a given process
            by setting the shell environment variable, `HF_DATASETS_CACHE`,
            to another directory:
                `export HF_DATASETS_CACHE="/path/to/another/directory"`
        :param download_mode: datasets.DownloadMode
            How to treat pre-existing `Task` downloads and data.
            - `datasets.DownloadMode.REUSE_DATASET_IF_EXISTS`
                Reuse download and reuse dataset.
            - `datasets.DownloadMode.REUSE_CACHE_IF_EXISTS`
                Reuse download with fresh dataset.
            - `datasets.DownloadMode.FORCE_REDOWNLOAD`
                Fresh download and fresh dataset.
        """
        self.dataset = datasets.load_dataset(
            path=self.DATASET_PATH,
            name=self.DATASET_NAME,
            data_dir=data_dir,
            cache_dir=cache_dir,
            download_mode=download_mode,
            **kwargs,           # <--- pass the additional kwargs
        )

And remove the download method in the ConfigurableTask.

What do you think?

zafstojano avatar May 21 '24 08:05 zafstojano

Yes, I think these should be unified.

One thing to check is that despite passing cache_dir directly in the Task.download() method, the HF environment variables should take precedence.

haileyschoelkopf avatar May 22 '24 13:05 haileyschoelkopf

Any progress on this? it is so inconvenient to pass local data for evaluation.

hrwise-nlp avatar Sep 09 '24 06:09 hrwise-nlp