datasets
datasets copied to clipboard
Make Image cast storage faster
PR for issue #6782.
Makes cast_storage
of the Image
class faster by removing the slow call to .pylist
.
Instead directly convert each ListArray
item to either Array2DExtensionType
or Array3DExtensionType
.
This also preserves the dtype
removing the warning if the array is already uint8
.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Hi ! Thanks for diving into this, this conversion to python lists is indeed quite slow.
Array2DExtensionType and Array3DExtensionType currently rely on pyarrow lists, but we will soon modify them to use FixedShapeTensorArray instead which is more efficient (e.g. doesn't need to store an offset for each value). So ideally it would be cool to speed this code up without using those extension types or it will be blocking to improve Array2DExtensionType and Array3DExtensionType.
If I understand correctly you just need the logic from ArrayExtensionArray.to_numpy ? If so feel free to make a separate function and ArrayExtensionArray.to_numpy can call it
Hey! I didn't have time to look into this but I just stumbled upon another problem. While my fix kind of made it usable I now pre-embedded the images and even as Array3D they are really slow to load. Don't think this can be resolved with using ArrayExtensionArray.to_numpy.
I think actually making the Array3DExtensionType faster would probably resolve both issues as you mentioned. Is there an update on using FixedShapeTensorArray? I'd gladly help implementing/testing it if there is some outline how to do it.
No one is working on this atm afaik (and actually we don't have any ETA unfortunately).
To do this change I think we need to:
- update the
_ArrayXD
parent class of all theArray2D
,Array3D
types to usepa.fixed_shape_tensor
type- pa_type = globals()[self.__class__.__name__ + "ExtensionType"](self.shape, self.dtype) + pa_type = pa.fixed_shape_tensor(self.shape, string_to_arrow(self.dtype))
- remove the old extension type
_ArrayXDExtensionType
and extension arrayArrayExtensionArray
- probably update some functions in
features.py
that were using those types and use the new ones instead
Thanks, I have looked into this and have a working solution at least for my specific case. But I had quite a few issues along the way that are not solved nicely. It follows your suggestion though internally it is then just a fixed_shape_tensor as there is no ExtensionType anymore.
Hopefully, I can create a separate PR with these changes soon.
Nice, thanks @Modexus !
I have run into some issues, notably I don't think FixedShapeTensorArray
is completely supported by pandas
and polars
.
Well it seems to work for pandas
but one loses the actual shape of the extension.
Polars
just throws an error and this cannot be changed with schema_overrides
as they are applied after.
I have tried to somehow cast the FixedShapeTensorArray
to something else like a nested FixedSizeLists, however I have not found a clean solution to do that.
If somebody has a clean solution to cast it to something such that the shape survives the roundtrip to pandas
/polars
and back, it may be possible.
Can we start using FixedShapeTensor or FixedSizeList even if pandas/polars don't support them fully yet ?
We would still get the benefit of optimized conversion to numpy