parquet-dotnet icon indicating copy to clipboard operation
parquet-dotnet copied to clipboard

Support RLE encoding in data pages for bools

Open EamonHetherton opened this issue 7 months ago • 0 comments

Issue description

My reading of the Parquet format suggests to me that it is allowable to use Run Length Encoding / Bit-Packing Hybrid (RLE = 3) encoding for bool values. With heavily repeated values RLE can be a lot more efficient than bit packing even for bools. I believe it is valid in both v1 and v2 files.

As an alternative, because it already support RLE/BitPacking for indexes, dictionary encoding for bools would give comparable size savings on heavily repeated values but with slightly more overhead but although the format spec does not appear to prohibit it, "parquet-mr" does not allow Dictionary encoding for bools (I suspect because it supports RLE/BitPacking encoding which will always be more efficient than a dictionary of bool anyway) so for compatibility reasons it's probably better to just support RLE/BitPacking for bool values and use that instead.

EamonHetherton avatar Jul 16 '24 01:07 EamonHetherton