polars
polars copied to clipboard
Selecting on a hive-partitioned LazyFrame produces the selected column + the partitioned column
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
df = pl.DataFrame({"col":[1,2,3,4,], "part":[1,1,2,2,]})
df.write_parquet("partitioned-parquet", use_pyarrow=True, pyarrow_options={"partition_cols":["part"]})
ldf = pl.scan_parquet("partitioned-parquet/**/*.parquet")
ldf.select("col").collect()
Log output
No response
Issue description
Using .select
on a LazyFrame that comes from a hive-partitioned parquet results in both the selection results and the hive partitioned column.
Expected behavior
The selection should return only what's computed in the select
context.
In our example, the current (bad) output is
shape: (4, 2)
┌─────┬──────┐
│ col ┆ part │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════╡
│ 1 ┆ 1 │
│ 2 ┆ 1 │
│ 3 ┆ 2 │
│ 4 ┆ 2 │
└─────┴──────┘
While the expected (good) output is
shape: (4, 1)
┌─────┐
│ col │
│ --- │
│ i64 │
╞═════╡
│ 1 │
│ 2 │
│ 3 │
│ 4 │
└─────┘
Note that there is an inconsistency between versions. Specifically, version 0.20.22
spits out the "bad" output, and 0.20.6
spits out the "good" output.
Installed versions
--------Version info---------
Polars: 0.20.22
Index type: UInt32
Platform: Windows-10-10.0.22631-SP0
Python: 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fastexcel: <not installed>
fsspec: <not installed>
gevent: <not installed>
hvplot: 0.9.2
matplotlib: 3.8.0
nest_asyncio: <not installed>
numpy: 1.26.3
openpyxl: <not installed>
pandas: 2.2.1
pyarrow: 14.0.2
pydantic: <not installed>
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
This is a bug in the optimizer. As a workaround, you can use .collect(projection_pushdown=False)
for now.
I put in a bunch of eprintln
statements and did a few experiments. Let's say we have col('a')
as the hive column and then col('b','c')
are in the files.
If I do pl.scan_parquet('./dataset/**/*.parquet').select(pl.col('b'),pl.col('c')).collect()
then we expect to only get b and c but we also get a, as already noted.
If I do pl.scan_parquet('./dataset/**/*.parquet').select(pl.col('b'),pl.col('c'), pl.col('b').alias('g')).collect()
then I get a copy of b as g AND I don't get a
. From all these eprintln statements I notice that in the second case we hit this chunk
https://github.com/pola-rs/polars/blob/efac81c65d623081f249f54e4e25ea05d6454ec1/crates/polars-lazy/src/physical_plan/executors/projection.rs#L32-L40 with the series composed of b,c, and g. However, the first case, with only b and c in the select, doesn't hit this code chunk.
Separately, I also found that https://github.com/pola-rs/polars/blob/efac81c65d623081f249f54e4e25ea05d6454ec1/crates/polars-lazy/src/physical_plan/executors/scan/mod.rs#L39-L59
returns the projection
and predicate
that is subsequently used by the reader but it only receives the physical columns, not each expression in the select. To put that a different way, in the second case, it doesn't see the 'g' column. However, the hive_columns that the readers get as projections is from the overall list of hive columns.
There are two potential fixes that I can see.
The first (I'm not sure where to even start with) is to make it so that execute_impl
from projection.rs is always called.
The second idea is to make it so the prepare_scan_args
function (really the materialize_projection) returns a third item which would be the hive columns that are needed and then send that to the readers instead of the full list of available hive columns.
That second idea seems pretty attainable because if I do pl.scan_parquet('./dataset/**/*.parquet').select(pl.col('b'),pl.col('c'),'a', pl.col('b').alias('g'), pl.col('a').alias('z')).collect()
then materialize_projection
sees 'b', 'c', and 'a' and doesn't get the superfluous extras so it just needs to return the pared down hive columns.
To be a little bit more specific, my thoughts are:
- Rename
hive_partitions
here toavailable_hive_parts
https://github.com/pola-rs/polars/blob/efac81c65d623081f249f54e4e25ea05d6454ec1/crates/polars-lazy/src/physical_plan/executors/scan/parquet.rs#L282-L285
- Have prepare_scan_args return take
available_hive_parts
as itshive_partitions
input and its output would be (projection, predicate, hive_partitions) here
https://github.com/pola-rs/polars/blob/efac81c65d623081f249f54e4e25ea05d6454ec1/crates/polars-lazy/src/physical_plan/executors/scan/parquet.rs#L287-L293
I think when the readers subsequently get the new pared down hive_partitions
that it'd fix this bug.
@ritchie46 @stinodego thoughts?
Fixed by https://github.com/pola-rs/polars/pull/17152