Tracking issue: Migrate from `re_arrow2` to `arrow`
Blockers
- Soft-blocked on https://github.com/apache/arrow-rs/issues/6360 (for lowering memory use)
- Semi-blocked on https://github.com/apache/arrow-rs/issues/4472 (we use
DataType::Extensionfor Tuid)
Multiple end-goals:
- Use same arrow lib as the rest of the ecosystem, which is where all the bug & perf fixes are actually happening
- Use inifinitely less space to store Arrow metadata (schema deduplication)
- #1809
- Make it possible to send raw Arrow data to Rerun and have it just work (
RERUN:component_name)- #3360
- Also frees up usage of Arrow extensions for actual native extensions (e.g. #3004)
- Native integration with
halfforf16 - Etc etc
TODO (split into sub-issues as needed):
- [ ] Remove all direct uses of
arrow2(codegen, data{cell,row,table},ArrowBuffer, etc)- Related: #2978
- [x] Migrate serde-based components (i.e. blueprint stuff) to
arrow1- https://docs.rs/arrow-json/47.0.0/arrow_json/reader/struct.Decoder.html#method.serialize might be all we need
- [ ] Get rid of Arrow extensions everywhere, introduce
RERUN:component_name(#3360) - [ ] Runtime schema registry / dedupe datatypes (#1809)
- [ ] Remove
DataCell::component_name - [ ] Replace
TransportChunkwithRecordBatch?
On the way there we might hit a few bumps because we have a lot of redundant ad-hoc code that integrates with polars (which is built on top of arrow2).
The solution to this is to make sure we only integrate with polars in one single place: the Data{Cell,Row,Table} layer (https://github.com/rerun-io/rerun/issues/1692).
Once that's done, we can remove all ad-hoc polars code everywhere and just build a Data{Row,Cell,Table} anytime we want a polars::Series/polars::DataFrame (#1759).
Internally, the conversion from DataTable to polars::DataFrame will require a zero-copy tri-stage conversion from arrow1->arrow2->polars.
- Supersedes https://github.com/rerun-io/rerun/issues/1805
- Supersedes #2354
re_arrow2 has an arrow feature, with glue for converting data between arrow and re_arrow2: https://docs.rs/re_arrow2/0.17.4/re_arrow2/array/trait.Arrow2Arrow.html
Using that we can start this migration piece-wise. It would have double the dependencies for a transitionary period, leading to longer compilation times and bigger .wasm binary, but I think that is an ok tradeoff.
Potential roadmap:
- [x] Verify that
Arrow2Arrowis zero-copy- https://github.com/rerun-io/re_arrow2/pull/6
- https://github.com/rerun-io/rerun/issues/6819
- [ ] Move
SizeBytesto own crate, with separatearrowandarrow2feature flags - [x] Rename
to_arrow/from_arrow/…toto_arrow2/from_arrow2/… - [x] Add poly-filled
to_arrow/from_arrowusing the glue - [ ] Migrate codegenned serialization
After de-chunkfification:
- [ ] Migrate codegenned deserialization
- [ ] Migrate everything else
As of 2024-07-08, there are only around 300 lines of Rust referencing the string arrow2 directly, when one ignores generated code.
ignored paths
crates/re_types/**, crates/re_types_core/src/archetypes/**, crates/re_types_core/src/datatypes/**, crates/re_types_core/src/components/**, crates/re_types_blueprint/src/blueprint/components/**, crates/re_types_blueprint/src/blueprint/archetypes/**
- I believe https://github.com/rerun-io/rerun/issues/6807 also requires bringing in a dependency on
arrow
Blocked on:
- https://github.com/apache/arrow-rs/pull/6300
New blocker:
- https://github.com/apache/arrow-rs/issues/6360
This is almost done now. What remains is:
- Porting
re_types_builder(the generated code is all arrow-rs, but the builder itself usesarrow2::Datatype) - IPC serialization (we're hitting a bug in arrow-rs). Look for
SERIALIZE_WITH_ARROW_1in the code for details - FFI communication in
rerun_c
I have added https://github.com/apache/arrow-rs/issues/7315 which I think is the root cause of the serialization error.
What's left now is mainly porting re_types_builder, which should be pretty straight-forward