santa
santa copied to clipboard
Add parquet output using parquet2 via Rust
This patch series adds support for building the parquet2 Rust crate and using it, from C++, to write a parquet file.
This is ready for review. What's been done so far:
- Rust build support
- FFI interface and linkage with C++ code using cxx crate
- Parquet Table implementation on top of the parquet2 crate
- C++ bindings to generate a Parquet table and write it to a file
- Unit tests in C++ and Rust and an e2e test using Pandas (via Docker)
The code size when compiled is about 3.5 MiB, which can probably be reduced further with future work. (Most of the .text size is compression libraries, and brotli is one of the larger ones.) This compares favorably to 10-30 MiB for C++ Arrow (depending on how you build it and how unmaintainable you're willing to make things for the sake of shrinking the build).
The build time for the whole thing is about 5 seconds. Again this compares favorably to Arrow (in Pedro, that build takes about 30-60 seconds.)
As I mentioned to Pete, this PR is quite large, but I think anything smaller wouldn't be reviewable, because it couldn't work end-to-end. I'm happy to jump on a video call and walk people through it. There is also quite detailed commit history, if you'd like to step through it. (Though a word of warning, the first draft of the Rust code got rewritten.)
Future work:
- Benchmarking
- Get rid of the buffer reinitialization after flush (will probably need to rewrite FileWriter to an imperative style API, which can wait)
- Wrap in a C++ class for use Santa's Logger
Sorry for the timeline saying I added 19 commits, I just rebased the branch onto main and GH got confused. (I also added a few more comments.)
Still looking into internal build stuff with this.
Apologies for the long delay here. There were some internal things that needed to happen before we could resume looking at this.
@the80srobot Can you make this a static library as it simplifies the internal builds dramatically?
Moving back to draft for now. This PR has become a bit stale and will require some effort to get back to being merge-ready.