distilabel
distilabel copied to clipboard
select num_examples from HF dataset
In some cases, e.g. testing your pipeline before running it, one would like to select only a couple of examples from the HF dataset loaded in src.distilabel.steps.generators.huggingface.LoadHubDataset.
Therefore I offer two ways of selecting num_examples from a HF dataset:
- either pretty straight forwardly by
self._dataset.select(range(self.num_examples):
https://github.com/rasdani/distilabel/blob/d70076df8b360d4bb09a713af46578669d3c3d54/src/distilabel/steps/generators/huggingface.py#L122
- or by modifying the already existing
num_examples. I'm not sure though, if this has any downstream effects for batching/yielding:
https://github.com/rasdani/distilabel/blob/d70076df8b360d4bb09a713af46578669d3c3d54/src/distilabel/steps/generators/huggingface.py#L137
We could change num_examples inside _get_dataset_num_examples(self), too.
I just noticed that self._dataset is an IterableDataset, therefore .select() is not supported.
Than probably some version of the later implementation?
Hi @rasdani, thanks for the PR!
Yes, IterableDatasets doesn't allow selecting... I think it would make sense adding streaming as a RuntimeParameter too and this logic: if streaming == False, then we can use select with num_examples.
Hi @rasdani indeed this is something already in develop as completed recently by @plaguss in https://github.com/argilla-io/distilabel/pull/565