datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Shard generator

Open marianna13 opened this issue 3 years ago • 5 comments

Hi everyone! I was using Hugging Face datasets to process some very large datasets and found that it would be quite handy to have a feature that will allow to "split" these large datasets into chunks with equal size. Even better - be able to run through these chunks one by one in simple and convenient way. So I decided to add the method called shard_generator() to the main Dataset class. It works similar to shard method but it returns a generator of datasets with equal size (defined by shard_size attribute). Example:

>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="validation")
>>> ds
Dataset({
      features: ['text', 'label'],
      num_rows: 1066
})
>>> next(ds.shard_generator(300))
Dataset({
      features: ['text', 'label'],
      num_rows: 300
})

I hope it can be helpful to someone. Thanks!

marianna13 avatar Aug 06 '22 09:08 marianna13

Hi, thanks!

I was using Hugging Face datasets to process some very large datasets and found that it would be quite handy to have a feature that will allow to "split" these large datasets into chunks with equal size

map, the method we use for processing in datasets, already does that if batched=True. And you can control the batch size with batch_size.

Even better - be able to run through these chunks one by one in simple and convenient way

It's not hard to do this "manually" with the existing API:

batch_size = <BATCH_SIZE>
for i in range(len(dset) // batch_size)
    shard = dset[i * batch_size:(i+1) * batch_size] # a dict of lists
    shard = Dataset.from_dict(shard)

(should be of similar performance to your implementation)

Still, I think an API like that could be useful if implemented efficiently (see this discussion to understand what's the issue with select/__getitem__ on which your implementation relies on), which can be done with pa.Table.to_reader in PyArrow 8.0.0+, .

@lhoestq @albertvillanova wdyt? We could use such API to efficiently iterate over the batches in map before processing them.

mariosasko avatar Aug 12 '22 15:08 mariosasko

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

This is more efficient since it doesn't bring the data in memory:

for i in range(len(dset) // batch_size)
    start = i * batch_size
    end = min((i+1) * batch_size, len(dset))
    shard = dset.select(range(start, end))

@marianna13 can you give more details on when it would be handy to have this shard generator ?

lhoestq avatar Aug 18 '22 15:08 lhoestq

This is more efficient since it doesn't bring the data in memory:

for i in range(len(dset) // batch_size)
    start = i * batch_size
    end = min((i+1) * batch_size, len(dset))
    shard = dset.select(range(start, end))

@marianna13 can you give more details on when it would be handy to have this shard generator ?

Sure! I used such generator when I needed to process a very large dataset (>1TB) in parallel, I've found out empirically that it's much more efficient to do that by processing only one part of the dataset with the shard generator. I tried to use a map with batching but it causesd oom errors, I tried to use the normal shard and here's what I came up with. So I thought it might be helpful to someone else!

marianna13 avatar Aug 18 '22 15:08 marianna13

I see thanks ! map should work just fine even at this scale, feel free to open an issue if you'd like to discuss your OOM issue.

Regarding shard_generator, since it is pretty straightforward to get shards I'm not sure we need that extra Dataset method

lhoestq avatar Aug 19 '22 10:08 lhoestq

Hi again! We've just added _iter_batches(batch_size) to the Dataset API for fast iteration over batches/chunks, so I think we can close this PR. Compared to this implementation, _iter_batches leverages pa.Table.to_reader for chunking, which makes it significantly faster.

mariosasko avatar Oct 03 '22 15:10 mariosasko