datasets Shard generator

Hi everyone! I was using Hugging Face datasets to process some very large datasets and found that it would be quite handy to have a feature that will allow to "split" these large datasets into chunks with equal size. Even better - be able to run through these chunks one by one in simple and convenient way. So I decided to add the method called shard_generator() to the main Dataset class. It works similar to shard method but it returns a generator of datasets with equal size (defined by shard_size attribute). Example:

>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="validation")
>>> ds
Dataset({
      features: ['text', 'label'],
      num_rows: 1066
})
>>> next(ds.shard_generator(300))
Dataset({
      features: ['text', 'label'],
      num_rows: 300
})

I hope it can be helpful to someone. Thanks!

Aug 06 '22 09:08 marianna13

Hi, thanks!

I was using Hugging Face datasets to process some very large datasets and found that it would be quite handy to have a feature that will allow to "split" these large datasets into chunks with equal size

map, the method we use for processing in datasets, already does that if batched=True. And you can control the batch size with batch_size.

Even better - be able to run through these chunks one by one in simple and convenient way

It's not hard to do this "manually" with the existing API:

batch_size = <BATCH_SIZE>
for i in range(len(dset) // batch_size)
    shard = dset[i * batch_size:(i+1) * batch_size] # a dict of lists
    shard = Dataset.from_dict(shard)

(should be of similar performance to your implementation)

Still, I think an API like that could be useful if implemented efficiently (see this discussion to understand what's the issue with select/__getitem__ on which your implementation relies on), which can be done with pa.Table.to_reader in PyArrow 8.0.0+, .

@lhoestq @albertvillanova wdyt? We could use such API to efficiently iterate over the batches in map before processing them.

Aug 12 '22 15:08 mariosasko

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Aug 12 '22 16:08 HuggingFaceDocBuilderDev

This is more efficient since it doesn't bring the data in memory:

for i in range(len(dset) // batch_size)
    start = i * batch_size
    end = min((i+1) * batch_size, len(dset))
    shard = dset.select(range(start, end))

@marianna13 can you give more details on when it would be handy to have this shard generator ?

Aug 18 '22 15:08 lhoestq

This is more efficient since it doesn't bring the data in memory:
for i in range(len(dset) // batch_size)
    start = i * batch_size
    end = min((i+1) * batch_size, len(dset))
    shard = dset.select(range(start, end))
@marianna13 can you give more details on when it would be handy to have this shard generator ?

Sure! I used such generator when I needed to process a very large dataset (>1TB) in parallel, I've found out empirically that it's much more efficient to do that by processing only one part of the dataset with the shard generator. I tried to use a map with batching but it causesd oom errors, I tried to use the normal shard and here's what I came up with. So I thought it might be helpful to someone else!

Aug 18 '22 15:08 marianna13

I see thanks ! map should work just fine even at this scale, feel free to open an issue if you'd like to discuss your OOM issue.

Regarding shard_generator, since it is pretty straightforward to get shards I'm not sure we need that extra Dataset method

Aug 19 '22 10:08 lhoestq

Hi again! We've just added _iter_batches(batch_size) to the Dataset API for fast iteration over batches/chunks, so I think we can close this PR. Compared to this implementation, _iter_batches leverages pa.Table.to_reader for chunking, which makes it significantly faster.

Oct 03 '22 15:10 mariosasko