arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Add RunEndEncoded Array Support

Open vustef opened this issue 1 month ago • 0 comments

Implements support for Arrow's RunEndEncoded (REE) layout as specified in the Arrow format specification. REE is a run-length encoding variant that efficiently stores arrays with repeated values using two child arrays: run_ends (indices where runs terminate) and values (the actual run values).

Implementation

  • Core type: Added Arrow.RunEndEncoded{T,R,A} struct with O(log n) binary search indexing
  • Type system: Registered RunEndEncodedKind in ArrowTypes module
  • Serialization: Implemented arrowvector() and makenodesbuffers!() for writing REE arrays
  • Deserialization: Added build() function and juliaeltype() for reading REE arrays from Arrow IPC format
  • Interoperability: Validated against PyArrow-generated test files (included as fixtures)

Testing

  • Cross-language validation using PyArrow 20.0.0-generated test files
  • Round-trip tests for various data types (integers, floats, strings, booleans, with nulls)
  • Edge cases: single runs, alternating values, long runs

Closes #476

vustef avatar Nov 17 '25 14:11 vustef