parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

Make compression adaptive with V2 data pages

Open pitrou opened this issue 2 months ago • 3 comments

Describe the enhancement requested

When writing a V2 data page, it seems that compression is always unconditionally enabled even when compression doesn't actually yield any benefits: https://github.com/apache/parquet-java/blob/0fea3e1e22fffb0a25193e3efb9a5d090899458a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L305-L311

It would be relatively easy to use a hardcoded threshold (for example 98%) above which compression is disabled, which makes reading faster.

Component(s)

Core

pitrou avatar Oct 08 '25 12:10 pitrou

In case it helps for reference, @mapleFU made a similar change in the arrow-rs / Rust parquet reader a few weeks ago:

  • https://github.com/apache/arrow-rs/pull/8257

alamb avatar Oct 22 '25 10:10 alamb

Yes, and I agree that adding a tunable parameter would be even better. In Arrow C++ IPC we have: https://github.com/apache/arrow/blob/03896451c69658105e857ae7103e5081bbaa9bd6/cpp/src/arrow/ipc/options.h#L72-L85

It could have a conservative default such as 0.02.

pitrou avatar Oct 22 '25 10:10 pitrou

A config is ok to me. It also makes sense in https://github.com/apache/arrow/blob/03896451c69658105e857ae7103e5081bbaa9bd6/cpp/src/parquet/column_writer.cc#L1051

mapleFU avatar Oct 22 '25 12:10 mapleFU