Add FixedSizeList DType / FixedShapeTensorArray ExtDType
Thanks for this great open source project! 🙏
I know that tensors are not supported yet, but I wanted to open an issue to enquire about their status on the roadmap.
Example:
import numpy as np
import pyarrow as pa
import vortex as va
data = np.array([[0, 1], [2, 3]])
ar = pa.FixedShapeTensorArray.from_numpy_ndarray(data)
var = vx.array(ar)
PanicException: Array encoding not implemented for Arrow data type
FixedSizeList(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2)
So on the one hand, we hope to provide zero-copy compatibility with all Arrow types, and so in that sense we should indeed add a Tensor type (similar to our missing decimal type #2395!). This would allow us to store a column of tensors in a table or other columnar structure.
But in your specific case, I wonder if you're actually looking to store n-dimensional arrays? That is, the top-level construct is an ND-array more similar to XArray / Zarr than Pandas / Parquet? For this we're thinking about how we might build a chunked n-dimensional file format on top of Vortex arrays that expose APIs supporting nd slicing. It probably ships with a slightly different default set of encodings (more tuned to scientific / tensor data) as well as the ability to reorder cells / chunks by arbitrary space-filling curves.
Happy to grab time to chat if this is of interest to you?
Just exploring at the moment! I'm a compression nerd and got intrigued by your docs. I sent you a DM on LI to connect.
We now have a tracking issue for fixed sized lists, will close & link this one to the parent one 🙂