polars icon indicating copy to clipboard operation
polars copied to clipboard

feat(rust): Add RLE to `RLE_DICTIONARY` encoder

Open thalassemia opened this issue 2 months ago • 2 comments

The RLE_DICTIONARY encoder currently only supports bit-packing. After this PR, the encoder will switch between bit-packing and RLE depending on whether a run is longer than 8 (always bit-pack a multiple of 8 values at a time).

This addresses the specific case pointed out in #10680. With these changes, I get 2745 bytes for pl_zstd.pq and 2962 bytes for pa_zstd.pq after running the following code:

from random import randint
import polars as pl
rand_value = randint(1, 10**15)
df = pl.DataFrame({
    'A': [rand_value for _ in range(10_000_000)],
}, schema={
    'A': pl.UInt64,
})
df.write_parquet('pa_zstd.pq', compression='zstd', use_pyarrow=True)
# Match PyArrow default row group size
df.write_parquet('pl_zstd.pq', compression='zstd', use_pyarrow=False, row_group_size=1024**2)

At first, the additional logic did slow down the encoder significantly. To address this, I did some profiling and made two optimizations.

  • The encoder begins by casting the input array to a dictionary array. Currently, the dictionary is created with u32 keys and downcasted to u16 if possible. I removed this downcasting step because it seemed unnecessary (f7e803a).
  • When populating the dictionary array, I noticed that a lot of time is spent in memcpy because the key vector is grown one value at a time. I resolved this by reserving the right vector size from the start (14fe89c).

After these optimizations, the RLE_DICTIONARY encoder performs on par with or only slightly worse than use_pyarrow=True in most cases (faster than current encoding performance). However, the performance can be up to 50% worse than use_pyarrow=True (matching current encoding performance) with high cardinality data that frequently switches between RLE and bit-packing:

import itertools
import polars as pl
# Performance degrades with frequent switches between RLE and bit-packing
t = pl.DataFrame({'a': itertools.chain.from_iterable([[i] * 9 + [i+1] * 8 for i in range(100000)])})

Some other changes:

  • Allow nested types to use RLE_DICTIONARY encoding (9c73a61). For some reason, dictionary arrays do not scale well for large nested columns. I cannot figure out what is causing this. For example:
nested_tbl = pl.DataFrame({'a': [[0] * 8 + [2] * 9 + [1] * 8 + [3] * 9] * 5000000})
int_tbl = pl.DataFrame({'a': ([0] * 8 + [2] * 9 + [1] * 8 + [3] * 9) * 5000000})
# 12 seconds, most of which is spent casting to dictionary array
nested_tbl.write_parquet('tmp.pq')
# 3 seconds
int_tbl.write_parquet('tmp.pq')
  • Currently, we use V2 data pages for Parquet. This comes with unclear advantages and one significant disadvantage. Unlike V1 data pages, V2 data pages do not compress definition and repetition levels. This leads to larger file sizes when columns have nested types. Since PyArrow currently defaults to V1 data pages, I believe Polars can safely do the same and reap the file size benefits (d38e1f4).

This does not fully resolve the linked issue because it still does not provide users any way of manually specifying encodings like in PyArrow.

thalassemia avatar Apr 29 '24 22:04 thalassemia

Codecov Report

Attention: Patch coverage is 97.74011% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 80.93%. Comparing base (f0dbb6a) to head (b50a82a). Report is 37 commits behind head on main.

:exclamation: Current head b50a82a differs from pull request most recent head 4f06b9e. Consider uploading reports for the commit 4f06b9e to get more accurate results

Files Patch % Lines
crates/polars-arrow/src/compute/cast/binary_to.rs 0.00% 1 Missing :warning:
crates/polars-arrow/src/compute/cast/binview_to.rs 50.00% 1 Missing :warning:
crates/polars-arrow/src/compute/cast/utf8_to.rs 0.00% 1 Missing :warning:
...parquet/src/parquet/encoding/hybrid_rle/encoder.rs 99.37% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15959      +/-   ##
==========================================
- Coverage   81.29%   80.93%   -0.36%     
==========================================
  Files        1381     1385       +4     
  Lines      176876   178291    +1415     
  Branches     3043     3050       +7     
==========================================
+ Hits       143789   144307     +518     
- Misses      32604    33493     +889     
- Partials      483      491       +8     
Flag Coverage Δ
python 74.47% <96.04%> (-0.28%) :arrow_down:
rust 78.15% <97.74%> (-0.26%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Apr 29 '24 22:04 codecov[bot]

Currently, we use V2 data pages for Parquet. This comes with unclear advantages and one significant disadvantage. Unlike V1 data pages, V2 data pages do not compress definition and repetition levels. This leads to larger file sizes when columns have nested types. Since PyArrow currently defaults to V1 data pages, I believe Polars can safely do the same and reap the file size benefits (https://github.com/pola-rs/polars/commit/d38e1f43d2bc419d83ace47088935b906e5e0948).

Hmm.. Do we actually compress those v1 pages then? :thinking:

Allow nested types to use RLE_DICTIONARY encoding (https://github.com/pola-rs/polars/commit/9c73a61a395d71183917d1db75cf2b08e4c2434f). For some reason, dictionary arrays do not scale well for large nested columns. I cannot figure out what is causing this. For example:

I don't think we should activate it for nested types then.

ritchie46 avatar May 01 '24 13:05 ritchie46

Hmm.. Do we actually compress those v1 pages then? 🤔

Yup, here's the relevant code. https://github.com/pola-rs/polars/blob/864e7504fc8b38e82688b43f63681b6cda77fd61/crates/polars-parquet/src/parquet/write/compression.rs#L23-L38

I don't think we should activate it for nested types then.

That makes sense. I'd like to revisit this later, but I think I'll need to learn some more Rust before tackling that problem.

Thank you for the review!

thalassemia avatar May 03 '24 03:05 thalassemia

That makes sense. I'd like to revisit this later, but I think I'll need to learn some more Rust before tackling that problem.

Ok, I think there is value in this, so I will pick up those points and get this in.

ritchie46 avatar May 03 '24 06:05 ritchie46

Thank you for the PR @thalassemia

ritchie46 avatar May 03 '24 14:05 ritchie46

We had to revert this because of https://github.com/pola-rs/polars/issues/16109

I do think these changes were interesting, so if you or anyone else has time to find the cause, that'd be appreciated.

ritchie46 avatar May 08 '24 10:05 ritchie46

@ritchie46 So sorry for this and thank you @nameexhaustion for fixing it! I should've written a test with more random, realistic data. Since this feature still sounds worthwhile, I'll look into this and include more robust tests for my next PR.

thalassemia avatar May 08 '24 16:05 thalassemia