polars icon indicating copy to clipboard operation
polars copied to clipboard

Polars cannot read DeltaBinaryPacked encoded files

Open Steiniche opened this issue 1 year ago • 3 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

filepath = "data/1.parquet"
df = pl.scan_parquet(filepath, n_rows=200)

Log output

File "/main.py", line 17, in <module>
    .collect()
  File "/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1943, in collect
    return wrap_df(ldf.collect())
polars.exceptions.ComputeError: Decoding Int64 "DeltaBinaryPacked"-encoded required  parquet pages not yet implemented

Issue description

Polars cannot read values which are Delta Binary Packed as described here: https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-encoding-delta_binary_packed--5

Expected behavior

That polars can read parquet files with Delta Binary Packed encoded columns.

Installed versions

--------Version info---------
Polars:               0.20.16
Index type:           UInt32
Platform:             Linux-6.6.10-76060610-generic-x86_64-with-glibc2.35
Python:               3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.2
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

Steiniche avatar Mar 21 '24 16:03 Steiniche

Could you share an example file that contains these?

ritchie46 avatar Mar 21 '24 19:03 ritchie46

The Parquet writers and readers have hardcoded encodings right now. Related to https://github.com/pola-rs/polars/issues/10680#issuecomment-1693058954

I have some example files that demonstrate how some data can be significantly smaller with alternative encodings. I have integral data sampled continuously over the course of about a day (pictured below). The delta binary packed encoding was about twice as small as the plain encoding data (when using the default ztd compression). The schema in these files was a datetime[µs] column and an i64 column.

Screenshot 2024-03-19 at 11 48 54 AM (1) Screenshot 2024-03-19 at 11 37 30 AM (1)

trueb2 avatar Mar 21 '24 19:03 trueb2

Unfortunately, I cannot share the files we are working on as they contain sensitive information.

I have tried to create a working example showing the problem. However, to my surprise it seems like Polars will read Delta Binary Packed files based on the following example

import pyarrow as pa
import pyarrow.parquet as pq
import polars as pl
import numpy
import random

num_rows = 1_000_000

data = {
    'id': [numpy.int64(random.randint(1, 10_000)) for _ in range(num_rows)],
    'value': [numpy.int64(random.randint(1, 10_000)) for _ in range(num_rows)],
}

table = pa.Table.from_pydict(data)

datafile = "data.parquet"

pq.write_table(table=table, where=datafile, column_encoding="DELTA_BINARY_PACKED", use_dictionary=False)

df = pl.scan_parquet(datafile)
print(df.collect())


My current hypothesis is that it must be a combination of multiple things that make the error occur. I will keep investing on my end and see if I can come up with an example that will make Polars throw the error.

Steiniche avatar Mar 25 '24 19:03 Steiniche