vortex icon indicating copy to clipboard operation
vortex copied to clipboard

Add FixedSizeList DType / FixedShapeTensorArray ExtDType

Open rabernat opened this issue 8 months ago • 2 comments

Thanks for this great open source project! 🙏

I know that tensors are not supported yet, but I wanted to open an issue to enquire about their status on the roadmap.

Example:

import numpy as np
import pyarrow as pa
import vortex as va

data = np.array([[0, 1], [2, 3]])
ar = pa.FixedShapeTensorArray.from_numpy_ndarray(data)
var = vx.array(ar)
PanicException: Array encoding not implemented for Arrow data type
FixedSizeList(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2)

rabernat avatar Apr 10 '25 15:04 rabernat

So on the one hand, we hope to provide zero-copy compatibility with all Arrow types, and so in that sense we should indeed add a Tensor type (similar to our missing decimal type #2395!). This would allow us to store a column of tensors in a table or other columnar structure.

But in your specific case, I wonder if you're actually looking to store n-dimensional arrays? That is, the top-level construct is an ND-array more similar to XArray / Zarr than Pandas / Parquet? For this we're thinking about how we might build a chunked n-dimensional file format on top of Vortex arrays that expose APIs supporting nd slicing. It probably ships with a slightly different default set of encodings (more tuned to scientific / tensor data) as well as the ability to reorder cells / chunks by arbitrary space-filling curves.

Happy to grab time to chat if this is of interest to you?

gatesn avatar Apr 11 '25 08:04 gatesn

Just exploring at the moment! I'm a compression nerd and got intrigued by your docs. I sent you a DM on LI to connect.

rabernat avatar Apr 11 '25 21:04 rabernat

We now have a tracking issue for fixed sized lists, will close & link this one to the parent one 🙂

blaginin avatar Aug 27 '25 22:08 blaginin