GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary
Rationale for this change
The current default for V1 pages is PLAIN encoding. This encoding mixes string length with string data. This is inefficient for for skipping N values, as the encoding does not allow random access. It's also slow to decode as the interleaving of lengths with data does not allow efficient batched implementations and forces most implementations to make copies of the data to fit the usual representation of separate offsets and data for strings.
DELTA_LENGTH_BYTE_ARRAY has none of the above problems as it separates offsets and data. The parquet-format spec also seems to recommend this https://github.com/apache/parquet-format/blob/c70281359087dfaee8bd43bed9748675f4aabe11/Encodings.md?plain=1#L299
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
Hey @raunaqmorarka thanks for raising this. I think we want to discuss on the devlist first if we want to change behavior. Would you be interested to raise this?
Hey @raunaqmorarka thanks for raising this. I think we want to discuss on the devlist first if we want to change behavior. Would you be interested to raise this?
I'm not sure how to start a discussion on the devlist, I don't have credentials to login there. It would be nice to discuss on the GH issue https://github.com/apache/parquet-java/issues/3083 if that's possible
@raunaqmorarka You can send an email to [email protected] to subscribe. If you don't want to subscribe, you may directly send an email to [email protected]. You can see https://lists.apache.org/[email protected] for reference.
In my opinion the default should not be changed, but it would be really useful to allow users to configure the encoding per column, similar to how BYTE_STREAM_SPLIT is now handled in ParquetProperties.
There might still be parquet readers with incomplete support for delta encodings, a simple version update could otherwise lead to problems which are only noticed much later.