santa icon indicating copy to clipboard operation
santa copied to clipboard

Add parquet output using parquet2 via Rust

Open the80srobot opened this issue 1 year ago • 6 comments

This patch series adds support for building the parquet2 Rust crate and using it, from C++, to write a parquet file.

the80srobot avatar Nov 24 '23 17:11 the80srobot

This is ready for review. What's been done so far:

  1. Rust build support
  2. FFI interface and linkage with C++ code using cxx crate
  3. Parquet Table implementation on top of the parquet2 crate
  4. C++ bindings to generate a Parquet table and write it to a file
  5. Unit tests in C++ and Rust and an e2e test using Pandas (via Docker)

The code size when compiled is about 3.5 MiB, which can probably be reduced further with future work. (Most of the .text size is compression libraries, and brotli is one of the larger ones.) This compares favorably to 10-30 MiB for C++ Arrow (depending on how you build it and how unmaintainable you're willing to make things for the sake of shrinking the build).

The build time for the whole thing is about 5 seconds. Again this compares favorably to Arrow (in Pedro, that build takes about 30-60 seconds.)

As I mentioned to Pete, this PR is quite large, but I think anything smaller wouldn't be reviewable, because it couldn't work end-to-end. I'm happy to jump on a video call and walk people through it. There is also quite detailed commit history, if you'd like to step through it. (Though a word of warning, the first draft of the Rust code got rewritten.)

Future work:

  • Benchmarking
  • Get rid of the buffer reinitialization after flush (will probably need to rewrite FileWriter to an imperative style API, which can wait)
  • Wrap in a C++ class for use Santa's Logger

the80srobot avatar Dec 14 '23 23:12 the80srobot

Sorry for the timeline saying I added 19 commits, I just rebased the branch onto main and GH got confused. (I also added a few more comments.)

the80srobot avatar Dec 15 '23 09:12 the80srobot

Still looking into internal build stuff with this.

pmarkowsky avatar Jan 09 '24 21:01 pmarkowsky

Apologies for the long delay here. There were some internal things that needed to happen before we could resume looking at this.

pmarkowsky avatar Mar 01 '24 15:03 pmarkowsky

@the80srobot Can you make this a static library as it simplifies the internal builds dramatically?

pmarkowsky avatar Mar 13 '24 16:03 pmarkowsky

Moving back to draft for now. This PR has become a bit stale and will require some effort to get back to being merge-ready.

mlw avatar Jul 02 '24 15:07 mlw