vortex icon indicating copy to clipboard operation
vortex copied to clipboard

Tracking Issue: Fixed-Size List Arrays

Open connortsui20 opened this issue 4 months ago • 2 comments

Tracking Issue: Fixed-Size List Arrays

This is a tracking issue for adding fixed-size list arrays to Vortex.

The main difference between lists and fixed-size lists is that fixed-size lists do not need to store an offsets buffer, since we are able to use the predictable stride pattern to jump to data. This allows us to more efficiently store data with known size (like matrices or tensors) as well as compute more efficiently with SIMD.

Public API

pub enum DType {
    /// A logical fixed-size list type.
    ///
    /// This is parameterized by a `DType` that represents the element type of the inner lists, as
    /// well as a `u32` size that determines the fixed length of each `FixedSizeList` scalar.
    FixedSizeList(Arc<DType>, u32, Nullability),
}

#[cfg(not(target_arch = "wasm32"))]
const_assert_eq!(size_of::<DType>(), 16);

#[cfg(target_arch = "wasm32")]
const_assert_eq!(size_of::<DType>(), 12); // Increase size to 12 bytes because of the `u32` field.

Prior Art

  • Apache Arrow Fixed-Size List Overview: https://arrow.apache.org/docs/format/Intro.html#fixed-size-list
  • Apache Arrow Fixed-Size List Layout: https://arrow.apache.org/docs/format/Columnar.html#fixed-size-list-layout
  • Apache Arrow Variable-Size List Layout: https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout
  • Apache Arrow Fixed-shape Tensor: https://arrow.apache.org/docs/format/CanonicalExtensions.html#fixed-shape-tensor
  • Arrow Rust FixedSizeListArray: https://docs.rs/arrow/latest/arrow/array/struct.FixedSizeListArray.html
  • pyarrow.FixedSizeListType: https://arrow.apache.org/docs/python/generated/pyarrow.FixedSizeListType.html
  • DuckDB Array Type (the same as fixed-size list): https://duckdb.org/docs/stable/sql/data_types/array.html
  • Adding a new variant (DType::Decimal) to Vortex: https://github.com/vortex-data/vortex/pull/3058
  • Conversion to numpy array potential optimizations: https://github.com/apache/arrow/issues/ 35622 (so we don't ping that issue)

Steps / History

  • [x] https://github.com/vortex-data/vortex/pull/4371
  • [x] https://github.com/vortex-data/vortex/pull/4380
  • [x] https://github.com/vortex-data/vortex/pull/4385
  • [X] https://github.com/vortex-data/vortex/pull/4405
  • [X] https://github.com/vortex-data/vortex/pull/4428
  • [x] https://github.com/vortex-data/vortex/pull/4495
  • [x] https://github.com/vortex-data/vortex/pull/4541
  • [x] https://github.com/vortex-data/vortex/pull/4545
  • [x] https://github.com/vortex-data/vortex/pull/4590
  • [x] https://github.com/vortex-data/vortex/pull/4592
  • [x] https://github.com/vortex-data/vortex/pull/4601
  • [x] https://github.com/vortex-data/vortex/pull/4610
  • [x] https://github.com/vortex-data/vortex/pull/4618

Unresolved Questions

  • [ ] Is it okay to increase the size of DType to 12 bytes on WASM (instead of the previous size of 8)? Or instead have an FixedListElementType that Arcs both the DType and the u32 size? If we do that then List should also be changed
  • [ ] Is u32 the correct type to use here (Arrow uses i32 for Java)

connortsui20 avatar Aug 26 '25 11:08 connortsui20

FixedSizedList doesn't actually increase DType size (which was already 16 and remains 16)

lwwmanning avatar Aug 27 '25 16:08 lwwmanning

It has increased it for wasm from 8 to 12. Though don’t think it matters

robert3005 avatar Aug 27 '25 16:08 robert3005