llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Allow multiprocessing when preparing ICL dataset

Open sanjari-orb opened this issue 8 months ago • 8 comments

🚀 Feature Request

Allow passing num_proc/num_workers parameter inInContextLearningDataset so that preparation of dataset can use more than one processes.

Motivation

When loading bigger ICL eval datasets, it is desirable to pass num_procs>1 in the following map function, which preps each example in the dataset: https://github.com/mosaicml/llm-foundry/blob/5571101a50804406ef0fe23e7ea6795b3c4a1bcb/llmfoundry/eval/datasets/in_context_learning_evaluation.py#L173-L181 Can we introduce a num_proc parameter in the InContextLearningDataset constructors so that the example preparation can instead be done like this:

        self.dataset: HFDataset = self.dataset.map(
            self._prep_example,
            with_indices=True,
            num_proc=num_proc,
            fn_kwargs={
                'num_fewshot': num_fewshot,
                'prompt_string': prompt_string,
                'fewshot_rng': fewshot_rng,
            },
        )

This greatly increases the speed of loading larger datasets.

sanjari-orb avatar Jun 13 '24 00:06 sanjari-orb