Make compression adaptive with V2 data pages
Describe the enhancement requested
When writing a V2 data page, it seems that compression is always unconditionally enabled even when compression doesn't actually yield any benefits: https://github.com/apache/parquet-java/blob/0fea3e1e22fffb0a25193e3efb9a5d090899458a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L305-L311
It would be relatively easy to use a hardcoded threshold (for example 98%) above which compression is disabled, which makes reading faster.
Component(s)
Core
In case it helps for reference, @mapleFU made a similar change in the arrow-rs / Rust parquet reader a few weeks ago:
- https://github.com/apache/arrow-rs/pull/8257
Yes, and I agree that adding a tunable parameter would be even better. In Arrow C++ IPC we have: https://github.com/apache/arrow/blob/03896451c69658105e857ae7103e5081bbaa9bd6/cpp/src/arrow/ipc/options.h#L72-L85
It could have a conservative default such as 0.02.
A config is ok to me. It also makes sense in https://github.com/apache/arrow/blob/03896451c69658105e857ae7103e5081bbaa9bd6/cpp/src/parquet/column_writer.cc#L1051