distilabel icon indicating copy to clipboard operation
distilabel copied to clipboard

select num_examples from HF dataset

Open rasdani opened this issue 1 year ago • 2 comments

In some cases, e.g. testing your pipeline before running it, one would like to select only a couple of examples from the HF dataset loaded in src.distilabel.steps.generators.huggingface.LoadHubDataset.

Therefore I offer two ways of selecting num_examples from a HF dataset:

  • either pretty straight forwardly by self._dataset.select(range(self.num_examples):

https://github.com/rasdani/distilabel/blob/d70076df8b360d4bb09a713af46578669d3c3d54/src/distilabel/steps/generators/huggingface.py#L122

  • or by modifying the already existing num_examples. I'm not sure though, if this has any downstream effects for batching/yielding:

https://github.com/rasdani/distilabel/blob/d70076df8b360d4bb09a713af46578669d3c3d54/src/distilabel/steps/generators/huggingface.py#L137

We could change num_examples inside _get_dataset_num_examples(self), too.

rasdani avatar Apr 11 '24 14:04 rasdani

I just noticed that self._dataset is an IterableDataset, therefore .select() is not supported.

Than probably some version of the later implementation?

rasdani avatar Apr 11 '24 14:04 rasdani

Hi @rasdani, thanks for the PR!

Yes, IterableDatasets doesn't allow selecting... I think it would make sense adding streaming as a RuntimeParameter too and this logic: if streaming == False, then we can use select with num_examples.

gabrielmbmb avatar Apr 15 '24 15:04 gabrielmbmb

Hi @rasdani indeed this is something already in develop as completed recently by @plaguss in https://github.com/argilla-io/distilabel/pull/565

alvarobartt avatar May 07 '24 11:05 alvarobartt