arrow2
arrow2 copied to clipboard
Separate flags for parquet2 compression formats?
Right now, arrow2
exposes the io_parquet_compression
feature flag to opt-in to the compression formats for parquet. However, this enables all of lz4
, zstd
, snappy
, gzip
, and brotli
, each of which can be fairly heavyweight to build. I imagine most folks will only want to use one of the above — e.g., I am only using zstd
.
On my laptop in a bare repository with thisCargo.toml
:
[dependencies]
arrow2 = { version = "0.10.1", features = ["io_parquet"], default-features = false }
cargo +nightly build --release --timings
Unit | Total | Codegen | |
---|---|---|---|
1. | arrow2 v0.10.1 | 15.1s | 11.1s (73%) |
2. | bitpacking v0.8.4 | 14.9s | 7.9s (53%) |
3. | zstd-sys v1.6.3+zstd.1.5.2 build script (run) | 12.5s | |
4. | brotli v3.3.3 | 10.4s | 8.0s (77%) |
5. | parquet-format-async-temp v0.2.0 | 8.3s | 5.6s (67%) |
6. | arrow-format v0.4.0 | 5.8s | 4.3s (74%) |
7. | serde_derive v1.0.136 | 5.6s | |
8. | parquet2 v0.10.3 | 4.9s | 4.1s (83%) |
9. | syn v1.0.89 | 4.1s | 1.9s (46%) |
10. | futures-util v0.3.21 | 4.0s | 0.7s (16%) |
11. | brotli-decompressor v2.3.2 | 3.9s | 3.3s (85%) |
12. | serde v1.0.136 | 3.2s | 0.4s (11%) |
13. | lz4-sys v1.9.3 build script (run) | 2.9s |
Switching to this Cargo.toml
:
[dependencies]
arrow2 = { version = "0.10.1", features = ["io_parquet"], default-features = false }
parquet2 = { version = "0.10", default_features = false, features = ["zstd"] }
Unit | Total | Codegen | |
---|---|---|---|
1. | arrow2 v0.10.1 | 13.5s | 11.1s (82%) |
2. | bitpacking v0.8.4 | 12.5s | 5.0s (40%) |
3. | zstd-sys v1.6.3+zstd.1.5.2 build script (run) | 8.4s | |
4. | parquet-format-async-temp v0.2.0 | 6.4s | 3.9s (61%) |
5. | serde_derive v1.0.136 | 5.0s | |
6. | futures-util v0.3.21 | 4.2s | 1.0s (25%) |
7. | arrow-format v0.4.0 | 3.7s | 2.5s (66%) |
8. | serde v1.0.136 | 3.2s | 0.5s (16%) |
9. | syn v1.0.89 | 2.9s | 1.3s (46%) |
10. | chrono v0.4.19 | 1.8s | 1.1s (62%) |
11. | futures-macro v0.3.21 | 1.7s | |
12. | parquet2 v0.10.3 | 1.6s | 1.1s (67%) |
13. | async-trait v0.1.52 | 1.6s |
It seems like there might be two solutions:
- advise folks who want finer-grained control to skip the
io_parquet_compression
flag, and instead directly put aparquet2
dependency in theirCargo.toml
, as I'm doing above. This seems rather fragile. - add individual
io_parquet_compression_zstd
etc flags.
Thoughts?
Hi @jenanwise and thank you for your analysis. I think it is a great idea to split the flag in multiple subflags.
We may keep io_parquet_compression = ["io_parquet_compression_zstd", "io_parquet_compression_snappy", ...]
since when reading, users usually want to support all compressions.
@jenanwise , would you like to work on this? It is a good first issue.
@jorgecarleitao Happily!
+1, this is also needed for me to depend on arrow2 and datafusion simultaneously since they both use zstd-sys but they use different versions so it creates a conflict
- advise folks who want finer-grained control to skip the
io_parquet_compression
flag, and instead directly put aparquet2
dependency in theirCargo.toml
, as I'm doing above. This seems rather fragile
Just a note that this works fine for my needs 🤷♂️
https://github.com/kylebarron/parquet-wasm/blob/93c498484f997a85a97c049cf4c1cbacce04fab8/Cargo.toml#L24-L29. I turn on each individual parquet2 dependency as needed.
@kylebarron thanks!
Apologies. I took a brief look — the implementation is trivial but I didn't get around to creating a test set. However, I'm no longer using arrow2
, so the issue is up for grabs.
@jorgecarleitao Btw, it looks like this issue was addressed by #1207