parquet-go
parquet-go copied to clipboard
Encoding question
Hi!
What would you recommend if I don't know whether my data is high cardinality or not for choosing PLAIN or PLAIN_DICTIONARY? Is it possible to make that trade-off on the first flush?
Thanks, Sean
Not a maintainer, but I think plain or plain_rle should be safer choice. Downside is a bigger output size.
You can read something here: https://github.com/apache/parquet-format/blob/master/Encodings.md
Though I have a related question to the maintainer. From that link:
If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding. The dictionary page is written first, before the data pages of the column chunk.
Do I understand correctly that this auto-fallback isn't implemented in parquet-go? And if it was, then there wouldn't any sense to use non-dictionary encodings?
The auto-fallback is not implemented. I arrived at asking this question because I was triggering an integer overflow on chunksize when using PLAIN_DICTIONARY with large, high cardinality BYTE_ARRAY data.
hi, @slydon , @simplylizz The auto-fallback is not implemented. Actually it use a int32 to store the index.