polars
polars copied to clipboard
feat(rust): Add RLE to `RLE_DICTIONARY` encoder
The RLE_DICTIONARY
encoder currently only supports bit-packing. After this PR, the encoder will switch between bit-packing and RLE depending on whether a run is longer than 8 (always bit-pack a multiple of 8 values at a time).
This addresses the specific case pointed out in #10680. With these changes, I get 2745 bytes for pl_zstd.pq
and 2962 bytes for pa_zstd.pq
after running the following code:
from random import randint
import polars as pl
rand_value = randint(1, 10**15)
df = pl.DataFrame({
'A': [rand_value for _ in range(10_000_000)],
}, schema={
'A': pl.UInt64,
})
df.write_parquet('pa_zstd.pq', compression='zstd', use_pyarrow=True)
# Match PyArrow default row group size
df.write_parquet('pl_zstd.pq', compression='zstd', use_pyarrow=False, row_group_size=1024**2)
At first, the additional logic did slow down the encoder significantly. To address this, I did some profiling and made two optimizations.
- The encoder begins by casting the input array to a dictionary array. Currently, the dictionary is created with
u32
keys and downcasted tou16
if possible. I removed this downcasting step because it seemed unnecessary (f7e803a). - When populating the dictionary array, I noticed that a lot of time is spent in memcpy because the key vector is grown one value at a time. I resolved this by reserving the right vector size from the start (14fe89c).
After these optimizations, the RLE_DICTIONARY
encoder performs on par with or only slightly worse than use_pyarrow=True
in most cases (faster than current encoding performance). However, the performance can be up to 50% worse than use_pyarrow=True
(matching current encoding performance) with high cardinality data that frequently switches between RLE and bit-packing:
import itertools
import polars as pl
# Performance degrades with frequent switches between RLE and bit-packing
t = pl.DataFrame({'a': itertools.chain.from_iterable([[i] * 9 + [i+1] * 8 for i in range(100000)])})
Some other changes:
- Allow nested types to use
RLE_DICTIONARY
encoding (9c73a61). For some reason, dictionary arrays do not scale well for large nested columns. I cannot figure out what is causing this. For example:
nested_tbl = pl.DataFrame({'a': [[0] * 8 + [2] * 9 + [1] * 8 + [3] * 9] * 5000000})
int_tbl = pl.DataFrame({'a': ([0] * 8 + [2] * 9 + [1] * 8 + [3] * 9) * 5000000})
# 12 seconds, most of which is spent casting to dictionary array
nested_tbl.write_parquet('tmp.pq')
# 3 seconds
int_tbl.write_parquet('tmp.pq')
- Currently, we use V2 data pages for Parquet. This comes with unclear advantages and one significant disadvantage. Unlike V1 data pages, V2 data pages do not compress definition and repetition levels. This leads to larger file sizes when columns have nested types. Since PyArrow currently defaults to V1 data pages, I believe Polars can safely do the same and reap the file size benefits (d38e1f4).
This does not fully resolve the linked issue because it still does not provide users any way of manually specifying encodings like in PyArrow.
Codecov Report
Attention: Patch coverage is 97.74011%
with 4 lines
in your changes are missing coverage. Please review.
Project coverage is 80.93%. Comparing base (
f0dbb6a
) to head (b50a82a
). Report is 37 commits behind head on main.
:exclamation: Current head b50a82a differs from pull request most recent head 4f06b9e. Consider uploading reports for the commit 4f06b9e to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## main #15959 +/- ##
==========================================
- Coverage 81.29% 80.93% -0.36%
==========================================
Files 1381 1385 +4
Lines 176876 178291 +1415
Branches 3043 3050 +7
==========================================
+ Hits 143789 144307 +518
- Misses 32604 33493 +889
- Partials 483 491 +8
Flag | Coverage Δ | |
---|---|---|
python | 74.47% <96.04%> (-0.28%) |
:arrow_down: |
rust | 78.15% <97.74%> (-0.26%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Currently, we use V2 data pages for Parquet. This comes with unclear advantages and one significant disadvantage. Unlike V1 data pages, V2 data pages do not compress definition and repetition levels. This leads to larger file sizes when columns have nested types. Since PyArrow currently defaults to V1 data pages, I believe Polars can safely do the same and reap the file size benefits (https://github.com/pola-rs/polars/commit/d38e1f43d2bc419d83ace47088935b906e5e0948).
Hmm.. Do we actually compress those v1 pages then? :thinking:
Allow nested types to use RLE_DICTIONARY encoding (https://github.com/pola-rs/polars/commit/9c73a61a395d71183917d1db75cf2b08e4c2434f). For some reason, dictionary arrays do not scale well for large nested columns. I cannot figure out what is causing this. For example:
I don't think we should activate it for nested types then.
Hmm.. Do we actually compress those v1 pages then? 🤔
Yup, here's the relevant code. https://github.com/pola-rs/polars/blob/864e7504fc8b38e82688b43f63681b6cda77fd61/crates/polars-parquet/src/parquet/write/compression.rs#L23-L38
I don't think we should activate it for nested types then.
That makes sense. I'd like to revisit this later, but I think I'll need to learn some more Rust before tackling that problem.
Thank you for the review!
That makes sense. I'd like to revisit this later, but I think I'll need to learn some more Rust before tackling that problem.
Ok, I think there is value in this, so I will pick up those points and get this in.
Thank you for the PR @thalassemia
We had to revert this because of https://github.com/pola-rs/polars/issues/16109
I do think these changes were interesting, so if you or anyone else has time to find the cause, that'd be appreciated.
@ritchie46 So sorry for this and thank you @nameexhaustion for fixing it! I should've written a test with more random, realistic data. Since this feature still sounds worthwhile, I'll look into this and include more robust tests for my next PR.