Shard generator
Hi everyone! I was using Hugging Face datasets to process some very large datasets and found that it would be quite handy to have a feature that will allow to "split" these large datasets into chunks with equal size. Even better - be able to run through these chunks one by one in simple and convenient way. So I decided to add the method called shard_generator() to the main Dataset class. It works similar to shard method but it returns a generator of datasets with equal size (defined by shard_size attribute). Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="validation")
>>> ds
Dataset({
features: ['text', 'label'],
num_rows: 1066
})
>>> next(ds.shard_generator(300))
Dataset({
features: ['text', 'label'],
num_rows: 300
})
I hope it can be helpful to someone. Thanks!
Hi, thanks!
I was using Hugging Face datasets to process some very large datasets and found that it would be quite handy to have a feature that will allow to "split" these large datasets into chunks with equal size
map, the method we use for processing in datasets, already does that if batched=True. And you can control the batch size with batch_size.
Even better - be able to run through these chunks one by one in simple and convenient way
It's not hard to do this "manually" with the existing API:
batch_size = <BATCH_SIZE>
for i in range(len(dset) // batch_size)
shard = dset[i * batch_size:(i+1) * batch_size] # a dict of lists
shard = Dataset.from_dict(shard)
(should be of similar performance to your implementation)
Still, I think an API like that could be useful if implemented efficiently (see this discussion to understand what's the issue with select/__getitem__ on which your implementation relies on), which can be done with pa.Table.to_reader in PyArrow 8.0.0+, .
@lhoestq @albertvillanova wdyt? We could use such API to efficiently iterate over the batches in map before processing them.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.
This is more efficient since it doesn't bring the data in memory:
for i in range(len(dset) // batch_size)
start = i * batch_size
end = min((i+1) * batch_size, len(dset))
shard = dset.select(range(start, end))
@marianna13 can you give more details on when it would be handy to have this shard generator ?
This is more efficient since it doesn't bring the data in memory:
for i in range(len(dset) // batch_size) start = i * batch_size end = min((i+1) * batch_size, len(dset)) shard = dset.select(range(start, end))@marianna13 can you give more details on when it would be handy to have this shard generator ?
Sure! I used such generator when I needed to process a very large dataset (>1TB) in parallel, I've found out empirically that it's much more efficient to do that by processing only one part of the dataset with the shard generator. I tried to use a map with batching but it causesd oom errors, I tried to use the normal shard and here's what I came up with. So I thought it might be helpful to someone else!
I see thanks ! map should work just fine even at this scale, feel free to open an issue if you'd like to discuss your OOM issue.
Regarding shard_generator, since it is pretty straightforward to get shards I'm not sure we need that extra Dataset method
Hi again! We've just added _iter_batches(batch_size) to the Dataset API for fast iteration over batches/chunks, so I think we can close this PR. Compared to this implementation, _iter_batches leverages pa.Table.to_reader for chunking, which makes it significantly faster.