datasets
datasets copied to clipboard
fast array extraction
Implements #7210 using method suggested in https://github.com/huggingface/datasets/pull/7207#issuecomment-2411789307
import numpy as np
from datasets import Dataset, Features, Array3D
features=Features(**{"array0": Array3D((None, 10, 10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,10,10), dtype=np.float32) for x in [2000,1000]*25] for i in range(2)}, features=features)
~0.02 s vs 0.9s on main
ds = dataset.to_iterable_dataset()
t0 = time.time()
for ex in ds:
pass
t1 = time.time()
< 0.01 s vs 1.3 s on main
@lhoestq I can see this breaks a bunch of array-related tests but can update the test cases if you would support making this change?
I also added an Array1D feature which will always be decoded into a numpy array and likewise improves extraction performance:
from datasets import Dataset, Features, Array1D, Sequence, Value
array_features=Features(**{"array0": Array1D((None,), dtype="float32"), "array1": Array1D((None,), dtype="float32")})
sequence_features=Features(**{"array0": Sequence(feature=Value("float32"), length=-1), "array1": Sequence(feature=Value("float32"), length=-1)})
array_dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,), dtype=np.float32) for x in [20000,10000]*25] for i in range(2)}, features=array_features)
sequence_dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,), dtype=np.float32) for x in [20000,10000]*25] for i in range(2)}, features=sequence_features)
```python
t0 = time.time()
for ex in array_dataset.to_iterable_dataset():
pass
t1 = time.time()
< 0.01 s
t0 = time.time()
for ex in sequence_dataset.to_iterable_dataset():
pass
t1 = time.time()
~1.1s
And also added support for extracting structs of arrays as dicts of numpy arrays:
import numpy as np
from datasets import Dataset, Features, Array3D, Sequence
features=Features(struct={"array0": Array3D((None,10,10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")}, _list=Sequence(feature=Array3D((None,10,10), dtype="float32")))
dataset = Dataset.from_dict({"struct": [{f"array{i}": np.zeros((x,10,10), dtype=np.float32) for i in range(2)} for x in [2000,1000]*25], "_list": [[np.zeros((x,10,10), dtype=np.float32) for i in range(2)] for x in [2000,1000]*25]}, features=features)
t0 = time.time()
for ex in dataset.to_iterable_dataset():
pass
t1 = time.time()
assert isinstance(ex["struct"]["array0"], np.ndarray) and ex["struct"]["array0"].ndim == 3
~0.02 s and no exception vs ~7s with an exception on main
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
I've updated the most straightforward failing test cases - lmk if you agree with those.
Might need some help / pointers on the remaining new failing tests, which seem a little bit more subtle.
@lhoestq I've had a go at fixing a few more test cases but getting quite uncertain about the remaining ones (as well as about some of the array writing ones that I tried to fix in my last commit). There are still 27 failures vs 21 on main. I'm not completely sure in some cases what intended behaviour is and my understanding of the flow for typed writing is a bit vague.
@lhoestq do you have any thoughts on this? I wasn't able to resolve all the test issues but the basic functionality seemed useful?