polars icon indicating copy to clipboard operation
polars copied to clipboard

Selecting on a hive-partitioned LazyFrame produces the selected column + the partitioned column

Open rancomp opened this issue 10 months ago • 1 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
df = pl.DataFrame({"col":[1,2,3,4,], "part":[1,1,2,2,]})
df.write_parquet("partitioned-parquet", use_pyarrow=True, pyarrow_options={"partition_cols":["part"]})
ldf = pl.scan_parquet("partitioned-parquet/**/*.parquet")
ldf.select("col").collect()

Log output

No response

Issue description

Using .select on a LazyFrame that comes from a hive-partitioned parquet results in both the selection results and the hive partitioned column.

Expected behavior

The selection should return only what's computed in the select context.

In our example, the current (bad) output is

shape: (4, 2)
┌─────┬──────┐
│ col ┆ part │
│ --- ┆ ---  │
│ i64 ┆ i64  │
╞═════╪══════╡
│ 1   ┆ 1    │
│ 2   ┆ 1    │
│ 3   ┆ 2    │
│ 4   ┆ 2    │
└─────┴──────┘

While the expected (good) output is

shape: (4, 1)
┌─────┐
│ col │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
│ 4   │
└─────┘

Note that there is an inconsistency between versions. Specifically, version 0.20.22 spits out the "bad" output, and 0.20.6 spits out the "good" output.

Installed versions

--------Version info---------
Polars:               0.20.22
Index type:           UInt32
Platform:             Windows-10-10.0.22631-SP0
Python:               3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           3.8.0
nest_asyncio:         <not installed>
numpy:                1.26.3
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              14.0.2
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

rancomp avatar Apr 21 '24 16:04 rancomp

This is a bug in the optimizer. As a workaround, you can use .collect(projection_pushdown=False) for now.

stinodego avatar Apr 22 '24 12:04 stinodego

I put in a bunch of eprintln statements and did a few experiments. Let's say we have col('a') as the hive column and then col('b','c') are in the files.

If I do pl.scan_parquet('./dataset/**/*.parquet').select(pl.col('b'),pl.col('c')).collect() then we expect to only get b and c but we also get a, as already noted.

If I do pl.scan_parquet('./dataset/**/*.parquet').select(pl.col('b'),pl.col('c'), pl.col('b').alias('g')).collect() then I get a copy of b as g AND I don't get a. From all these eprintln statements I notice that in the second case we hit this chunk

https://github.com/pola-rs/polars/blob/efac81c65d623081f249f54e4e25ea05d6454ec1/crates/polars-lazy/src/physical_plan/executors/projection.rs#L32-L40 with the series composed of b,c, and g. However, the first case, with only b and c in the select, doesn't hit this code chunk.

Separately, I also found that https://github.com/pola-rs/polars/blob/efac81c65d623081f249f54e4e25ea05d6454ec1/crates/polars-lazy/src/physical_plan/executors/scan/mod.rs#L39-L59

returns the projection and predicate that is subsequently used by the reader but it only receives the physical columns, not each expression in the select. To put that a different way, in the second case, it doesn't see the 'g' column. However, the hive_columns that the readers get as projections is from the overall list of hive columns.

There are two potential fixes that I can see.

The first (I'm not sure where to even start with) is to make it so that execute_impl from projection.rs is always called.

The second idea is to make it so the prepare_scan_args function (really the materialize_projection) returns a third item which would be the hive columns that are needed and then send that to the readers instead of the full list of available hive columns.

That second idea seems pretty attainable because if I do pl.scan_parquet('./dataset/**/*.parquet').select(pl.col('b'),pl.col('c'),'a', pl.col('b').alias('g'), pl.col('a').alias('z')).collect()

then materialize_projection sees 'b', 'c', and 'a' and doesn't get the superfluous extras so it just needs to return the pared down hive columns.

To be a little bit more specific, my thoughts are:

  1. Rename hive_partitions here to available_hive_parts

https://github.com/pola-rs/polars/blob/efac81c65d623081f249f54e4e25ea05d6454ec1/crates/polars-lazy/src/physical_plan/executors/scan/parquet.rs#L282-L285

  1. Have prepare_scan_args return take available_hive_parts as its hive_partitions input and its output would be (projection, predicate, hive_partitions) here

https://github.com/pola-rs/polars/blob/efac81c65d623081f249f54e4e25ea05d6454ec1/crates/polars-lazy/src/physical_plan/executors/scan/parquet.rs#L287-L293

I think when the readers subsequently get the new pared down hive_partitions that it'd fix this bug.

@ritchie46 @stinodego thoughts?

deanm0000 avatar Jun 06 '24 22:06 deanm0000

Fixed by https://github.com/pola-rs/polars/pull/17152

nameexhaustion avatar Jun 26 '24 11:06 nameexhaustion