rerun icon indicating copy to clipboard operation
rerun copied to clipboard

Tracking issue: Migrate from `re_arrow2` to `arrow`

Open teh-cmc opened this issue 2 years ago • 4 comments

Blockers

  • Soft-blocked on https://github.com/apache/arrow-rs/issues/6360 (for lowering memory use)
  • Semi-blocked on https://github.com/apache/arrow-rs/issues/4472 (we use DataType::Extension for Tuid)

Multiple end-goals:

  • Use same arrow lib as the rest of the ecosystem, which is where all the bug & perf fixes are actually happening
  • Use inifinitely less space to store Arrow metadata (schema deduplication)
    • #1809
  • Make it possible to send raw Arrow data to Rerun and have it just work (RERUN:component_name)
    • #3360
    • Also frees up usage of Arrow extensions for actual native extensions (e.g. #3004)
  • Native integration with half for f16
  • Etc etc

TODO (split into sub-issues as needed):

  • [ ] Remove all direct uses of arrow2 (codegen, data{cell,row,table}, ArrowBuffer, etc)
    • Related: #2978
  • [x] Migrate serde-based components (i.e. blueprint stuff) to arrow1
    • https://docs.rs/arrow-json/47.0.0/arrow_json/reader/struct.Decoder.html#method.serialize might be all we need
  • [ ] Get rid of Arrow extensions everywhere, introduce RERUN:component_name (#3360)
  • [ ] Runtime schema registry / dedupe datatypes (#1809)
  • [ ] Remove DataCell::component_name
  • [ ] Replace TransportChunk with RecordBatch?

On the way there we might hit a few bumps because we have a lot of redundant ad-hoc code that integrates with polars (which is built on top of arrow2).

The solution to this is to make sure we only integrate with polars in one single place: the Data{Cell,Row,Table} layer (https://github.com/rerun-io/rerun/issues/1692). Once that's done, we can remove all ad-hoc polars code everywhere and just build a Data{Row,Cell,Table} anytime we want a polars::Series/polars::DataFrame (#1759).

Internally, the conversion from DataTable to polars::DataFrame will require a zero-copy tri-stage conversion from arrow1->arrow2->polars.


  • Supersedes https://github.com/rerun-io/rerun/issues/1805
  • Supersedes #2354

teh-cmc avatar Oct 09 '23 10:10 teh-cmc

re_arrow2 has an arrow feature, with glue for converting data between arrow and re_arrow2: https://docs.rs/re_arrow2/0.17.4/re_arrow2/array/trait.Arrow2Arrow.html

Using that we can start this migration piece-wise. It would have double the dependencies for a transitionary period, leading to longer compilation times and bigger .wasm binary, but I think that is an ok tradeoff.

Potential roadmap:

  • [x] Verify that Arrow2Arrow is zero-copy
    • https://github.com/rerun-io/re_arrow2/pull/6
  • https://github.com/rerun-io/rerun/issues/6819
  • [ ] Move SizeBytes to own crate, with separate arrow and arrow2 feature flags
  • [x] Rename to_arrow/from_arrow/… to to_arrow2/from_arrow2/…
  • [x] Add poly-filled to_arrow/from_arrow using the glue
  • [ ] Migrate codegenned serialization

After de-chunkfification:

  • [ ] Migrate codegenned deserialization
  • [ ] Migrate everything else

As of 2024-07-08, there are only around 300 lines of Rust referencing the string arrow2 directly, when one ignores generated code.

ignored paths crates/re_types/**, crates/re_types_core/src/archetypes/**, crates/re_types_core/src/datatypes/**, crates/re_types_core/src/components/**, crates/re_types_blueprint/src/blueprint/components/**, crates/re_types_blueprint/src/blueprint/archetypes/**

emilk avatar Jul 08 '24 12:07 emilk

  • I believe https://github.com/rerun-io/rerun/issues/6807 also requires bringing in a dependency on arrow

jleibs avatar Jul 10 '24 17:07 jleibs

Blocked on:

  • https://github.com/apache/arrow-rs/pull/6300

teh-cmc avatar Aug 31 '24 18:08 teh-cmc

New blocker:

  • https://github.com/apache/arrow-rs/issues/6360

teh-cmc avatar Sep 05 '24 13:09 teh-cmc

This is almost done now. What remains is:

  • Porting re_types_builder (the generated code is all arrow-rs, but the builder itself uses arrow2::Datatype)
  • IPC serialization (we're hitting a bug in arrow-rs). Look for SERIALIZE_WITH_ARROW_1 in the code for details
  • FFI communication in rerun_c

emilk avatar Jan 21 '25 08:01 emilk

I have added https://github.com/apache/arrow-rs/issues/7315 which I think is the root cause of the serialization error.

timsaucer avatar Mar 20 '25 18:03 timsaucer

What's left now is mainly porting re_types_builder, which should be pretty straight-forward

emilk avatar Mar 27 '25 14:03 emilk