snappy-java icon indicating copy to clipboard operation
snappy-java copied to clipboard

Compression ratio degraded for repeated INT64 columns in parquet

Open liorchaga opened this issue 3 years ago • 1 comments

When testing upgrade to spark 3.1.1 I've noticed the compression of repeated INT64 columns compression got worse.

https://stackoverflow.com/questions/67413589/parquet-compression-degradation-when-upgrading-spark/67455721#67455721

Reading this file saved with snappy 1.1.2.6, and writing it with higher version results in compression ratio dropping from 2.05 to 1.26 Any snappy-version higher than 1.1.2.6 reproduced this issue.

liorchaga avatar May 09 '21 08:05 liorchaga

Does anyone know what changed to cause such a measurable change in compression ratio? I just tested 1.1.2.6 on one of my datasets and saw an immediate 20% savings.

gitrc avatar Jun 27 '22 21:06 gitrc