parquet-go Encoding question

Encoding question

Open slydon opened this issue 4 years ago • 3 comments

Hi!

What would you recommend if I don't know whether my data is high cardinality or not for choosing PLAIN or PLAIN_DICTIONARY? Is it possible to make that trade-off on the first flush?

Thanks, Sean

Jul 31 '20 06:07 slydon

Not a maintainer, but I think plain or plain_rle should be safer choice. Downside is a bigger output size.

You can read something here: https://github.com/apache/parquet-format/blob/master/Encodings.md

Though I have a related question to the maintainer. From that link:

If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding. The dictionary page is written first, before the data pages of the column chunk.

Do I understand correctly that this auto-fallback isn't implemented in parquet-go? And if it was, then there wouldn't any sense to use non-dictionary encodings?

Jul 31 '20 22:07 simplylizz

The auto-fallback is not implemented. I arrived at asking this question because I was triggering an integer overflow on chunksize when using PLAIN_DICTIONARY with large, high cardinality BYTE_ARRAY data.

Aug 01 '20 00:08 slydon

hi, @slydon , @simplylizz The auto-fallback is not implemented. Actually it use a int32 to store the index.

Aug 01 '20 13:08 xitongsys

parquet-go parquet-go copied to clipboard

Encoding question

parquet-go
parquet-go copied to clipboard