llm-foundry
llm-foundry copied to clipboard
Allow multiprocessing when preparing ICL dataset
🚀 Feature Request
Allow passing num_proc
/num_workers
parameter inInContextLearningDataset
so that preparation of dataset can use more than one processes.
Motivation
When loading bigger ICL eval datasets, it is desirable to pass num_procs>1 in the following map function, which preps each example in the dataset:
https://github.com/mosaicml/llm-foundry/blob/5571101a50804406ef0fe23e7ea6795b3c4a1bcb/llmfoundry/eval/datasets/in_context_learning_evaluation.py#L173-L181
Can we introduce a num_proc
parameter in the InContextLearningDataset
constructors so that the example preparation can instead be done like this:
self.dataset: HFDataset = self.dataset.map(
self._prep_example,
with_indices=True,
num_proc=num_proc,
fn_kwargs={
'num_fewshot': num_fewshot,
'prompt_string': prompt_string,
'fewshot_rng': fewshot_rng,
},
)
This greatly increases the speed of loading larger datasets.