arrow2 icon indicating copy to clipboard operation
arrow2 copied to clipboard

Separate flags for parquet2 compression formats?

Open jenanwise opened this issue 2 years ago • 8 comments

Right now, arrow2 exposes the io_parquet_compression feature flag to opt-in to the compression formats for parquet. However, this enables all of lz4, zstd, snappy, gzip, and brotli, each of which can be fairly heavyweight to build. I imagine most folks will only want to use one of the above — e.g., I am only using zstd.

On my laptop in a bare repository with thisCargo.toml:

[dependencies]
arrow2 = { version = "0.10.1", features = ["io_parquet"], default-features = false }

cargo +nightly build --release --timings

Unit Total Codegen
1. arrow2 v0.10.1 15.1s 11.1s (73%)
2. bitpacking v0.8.4 14.9s 7.9s (53%)
3. zstd-sys v1.6.3+zstd.1.5.2 build script (run) 12.5s
4. brotli v3.3.3 10.4s 8.0s (77%)
5. parquet-format-async-temp v0.2.0 8.3s 5.6s (67%)
6. arrow-format v0.4.0 5.8s 4.3s (74%)
7. serde_derive v1.0.136 5.6s
8. parquet2 v0.10.3 4.9s 4.1s (83%)
9. syn v1.0.89 4.1s 1.9s (46%)
10. futures-util v0.3.21 4.0s 0.7s (16%)
11. brotli-decompressor v2.3.2 3.9s 3.3s (85%)
12. serde v1.0.136 3.2s 0.4s (11%)
13. lz4-sys v1.9.3 build script (run) 2.9s  

Switching to this Cargo.toml:

[dependencies]
arrow2 = { version = "0.10.1", features = ["io_parquet"], default-features = false }
parquet2 = { version = "0.10", default_features = false, features = ["zstd"] }
Unit Total Codegen
1. arrow2 v0.10.1 13.5s 11.1s (82%)
2. bitpacking v0.8.4 12.5s 5.0s (40%)
3. zstd-sys v1.6.3+zstd.1.5.2 build script (run) 8.4s
4. parquet-format-async-temp v0.2.0 6.4s 3.9s (61%)
5. serde_derive v1.0.136 5.0s
6. futures-util v0.3.21 4.2s 1.0s (25%)
7. arrow-format v0.4.0 3.7s 2.5s (66%)
8. serde v1.0.136 3.2s 0.5s (16%)
9. syn v1.0.89 2.9s 1.3s (46%)
10. chrono v0.4.19 1.8s 1.1s (62%)
11. futures-macro v0.3.21 1.7s
12. parquet2 v0.10.3 1.6s 1.1s (67%)
13. async-trait v0.1.52 1.6s

It seems like there might be two solutions:

  1. advise folks who want finer-grained control to skip the io_parquet_compression flag, and instead directly put a parquet2 dependency in their Cargo.toml, as I'm doing above. This seems rather fragile.
  2. add individual io_parquet_compression_zstd etc flags.

Thoughts?

jenanwise avatar Mar 19 '22 02:03 jenanwise

Hi @jenanwise and thank you for your analysis. I think it is a great idea to split the flag in multiple subflags.

We may keep io_parquet_compression = ["io_parquet_compression_zstd", "io_parquet_compression_snappy", ...] since when reading, users usually want to support all compressions.

jorgecarleitao avatar Mar 19 '22 03:03 jorgecarleitao

@jenanwise , would you like to work on this? It is a good first issue.

jorgecarleitao avatar Mar 22 '22 20:03 jorgecarleitao

@jorgecarleitao Happily!

jenanwise avatar Mar 23 '22 23:03 jenanwise

+1, this is also needed for me to depend on arrow2 and datafusion simultaneously since they both use zstd-sys but they use different versions so it creates a conflict

ozgrakkurt avatar May 12 '22 20:05 ozgrakkurt

  1. advise folks who want finer-grained control to skip the io_parquet_compression flag, and instead directly put a parquet2 dependency in their Cargo.toml, as I'm doing above. This seems rather fragile

Just a note that this works fine for my needs 🤷‍♂️
https://github.com/kylebarron/parquet-wasm/blob/93c498484f997a85a97c049cf4c1cbacce04fab8/Cargo.toml#L24-L29. I turn on each individual parquet2 dependency as needed.

kylebarron avatar May 12 '22 20:05 kylebarron

@kylebarron thanks!

ozgrakkurt avatar May 12 '22 20:05 ozgrakkurt

Apologies. I took a brief look — the implementation is trivial but I didn't get around to creating a test set. However, I'm no longer using arrow2, so the issue is up for grabs.

jenanwise avatar Oct 17 '22 02:10 jenanwise

@jorgecarleitao Btw, it looks like this issue was addressed by #1207

MostlyAmiable avatar Oct 25 '23 21:10 MostlyAmiable