snappy-java
snappy-java copied to clipboard
Compression ratio degraded for repeated INT64 columns in parquet
When testing upgrade to spark 3.1.1 I've noticed the compression of repeated INT64 columns compression got worse.
https://stackoverflow.com/questions/67413589/parquet-compression-degradation-when-upgrading-spark/67455721#67455721
Reading this file saved with snappy 1.1.2.6, and writing it with higher version results in compression ratio dropping from 2.05 to 1.26 Any snappy-version higher than 1.1.2.6 reproduced this issue.
Does anyone know what changed to cause such a measurable change in compression ratio? I just tested 1.1.2.6 on one of my datasets and saw an immediate 20% savings.