datasets fast array extraction

Implements #7210 using method suggested in https://github.com/huggingface/datasets/pull/7207#issuecomment-2411789307

import numpy as np
from datasets import Dataset, Features, Array3D
features=Features(**{"array0": Array3D((None, 10, 10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,10,10), dtype=np.float32) for x in [2000,1000]*25] for i in range(2)}, features=features)

~0.02 s vs 0.9s on main

ds = dataset.to_iterable_dataset()
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

< 0.01 s vs 1.3 s on main

@lhoestq I can see this breaks a bunch of array-related tests but can update the test cases if you would support making this change?

I also added an Array1D feature which will always be decoded into a numpy array and likewise improves extraction performance:

from datasets import Dataset, Features, Array1D, Sequence, Value
array_features=Features(**{"array0": Array1D((None,), dtype="float32"), "array1": Array1D((None,), dtype="float32")})
sequence_features=Features(**{"array0": Sequence(feature=Value("float32"), length=-1), "array1": Sequence(feature=Value("float32"), length=-1)})
array_dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,), dtype=np.float32) for x in [20000,10000]*25] for i in range(2)}, features=array_features)
sequence_dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,), dtype=np.float32) for x in [20000,10000]*25] for i in range(2)}, features=sequence_features)


```python
t0 = time.time()
for ex in array_dataset.to_iterable_dataset():
    pass
t1 = time.time()

< 0.01 s

t0 = time.time()
for ex in sequence_dataset.to_iterable_dataset():
    pass
t1 = time.time()

~1.1s

And also added support for extracting structs of arrays as dicts of numpy arrays:

import numpy as np
from datasets import Dataset, Features, Array3D, Sequence
features=Features(struct={"array0": Array3D((None,10,10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")}, _list=Sequence(feature=Array3D((None,10,10), dtype="float32")))
dataset = Dataset.from_dict({"struct": [{f"array{i}": np.zeros((x,10,10), dtype=np.float32) for i in range(2)} for x in [2000,1000]*25], "_list": [[np.zeros((x,10,10), dtype=np.float32) for i in range(2)] for x in [2000,1000]*25]}, features=features)

t0 = time.time()
for ex in dataset.to_iterable_dataset():
    pass
t1 = time.time()
assert isinstance(ex["struct"]["array0"], np.ndarray) and ex["struct"]["array0"].ndim == 3

~0.02 s and no exception vs ~7s with an exception on main

Oct 14 '24 20:10 alex-hh

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Oct 15 '24 14:10 HuggingFaceDocBuilderDev

I've updated the most straightforward failing test cases - lmk if you agree with those.

Might need some help / pointers on the remaining new failing tests, which seem a little bit more subtle.

Oct 15 '24 16:10 alex-hh

@lhoestq I've had a go at fixing a few more test cases but getting quite uncertain about the remaining ones (as well as about some of the array writing ones that I tried to fix in my last commit). There are still 27 failures vs 21 on main. I'm not completely sure in some cases what intended behaviour is and my understanding of the flow for typed writing is a bit vague.

Oct 18 '24 11:10 alex-hh

@lhoestq do you have any thoughts on this? I wasn't able to resolve all the test issues but the basic functionality seemed useful?

Jan 28 '25 09:01 alex-hh

datasets datasets copied to clipboard

fast array extraction

datasets
datasets copied to clipboard