parquet-java GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary

Rationale for this change

The current default for V1 pages is PLAIN encoding. This encoding mixes string length with string data. This is inefficient for for skipping N values, as the encoding does not allow random access. It's also slow to decode as the interleaving of lengths with data does not allow efficient batched implementations and forces most implementations to make copies of the data to fit the usual representation of separate offsets and data for strings.

DELTA_LENGTH_BYTE_ARRAY has none of the above problems as it separates offsets and data. The parquet-format spec also seems to recommend this https://github.com/apache/parquet-format/blob/c70281359087dfaee8bd43bed9748675f4aabe11/Encodings.md?plain=1#L299

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Nov 28 '24 13:11 raunaqmorarka

Hey @raunaqmorarka thanks for raising this. I think we want to discuss on the devlist first if we want to change behavior. Would you be interested to raise this?

Nov 28 '24 14:11 Fokko

Hey @raunaqmorarka thanks for raising this. I think we want to discuss on the devlist first if we want to change behavior. Would you be interested to raise this?

I'm not sure how to start a discussion on the devlist, I don't have credentials to login there. It would be nice to discuss on the GH issue https://github.com/apache/parquet-java/issues/3083 if that's possible

Nov 28 '24 14:11 raunaqmorarka

@raunaqmorarka You can send an email to [email protected] to subscribe. If you don't want to subscribe, you may directly send an email to [email protected]. You can see https://lists.apache.org/[email protected] for reference.

Nov 28 '24 15:11 wgtmac

In my opinion the default should not be changed, but it would be really useful to allow users to configure the encoding per column, similar to how BYTE_STREAM_SPLIT is now handled in ParquetProperties.

There might still be parquet readers with incomplete support for delta encodings, a simple version update could otherwise lead to problems which are only noticed much later.

Jan 17 '25 13:01 jhorstmann