polars icon indicating copy to clipboard operation
polars copied to clipboard

Panic on `polars.scan_parquet().filter().columns`

Open jmakov opened this issue 1 year ago • 4 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

lf = polars.scan_parquet()
lf.columns  # works
lf.filter(col1 > 123).columns  # panics

Log output

thread 'python' panicked at crates/polars-plan/src/logical_plan/optimizer/predicate_pushdown/mod.rs:359:69:
called `Option::unwrap()` on a `None` value

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[10], line 1
----> 1 _lf.filter(polars.col("ts_event")>123).columns

File ~/mambaforge-pypy3/envs/quantlab/lib/python3.11/site-packages/polars/lazyframe/frame.py:411, in LazyFrame.columns(self)
    394 @property
    395 def columns(self) -> list[str]:
    396     """
    397     Get column names.
    398 
   (...)
    409     ['foo', 'bar']
    410     """
--> 411     return self._ldf.columns()

PanicException: called `Option::unwrap()` on a `None` value

Issue description

Panic on polars.scan_parquet().filter().columns. Also why are you calling unwrap() in production code?

Expected behavior

Printout of columns

Installed versions

Polars:               0.20.25
Index type:           UInt32
Platform:             Linux-6.6.25-1-MANJARO-x86_64-with-glibc2.39
Python:               3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           0.3.2
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1
gevent:               24.2.1
hvplot:               0.10.0
matplotlib:           3.7.5
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.1.4
pyarrow:              14.0.2
pydantic:             2.7.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.30
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

jmakov avatar May 10 '24 07:05 jmakov

Have you got a repro with a dummy file?

ritchie46 avatar May 10 '24 08:05 ritchie46

No. Tried this but it works so I'd have to look a bit deeper what triggers the problem:

# this works
df = polars.DataFrame({"col1": range(10)})
df.write_parquet("test.parquet")

lf = polars.scan_parquet("test.parquet")
lf.filter(polars.col("col1") > 3).columns

The original parquet dataset is partitioned. And this fails:

# prquet dataset structured as /mnt/data/schema_name/partition1/partition2/year/month.parquet
lf = polars.scan_parquet(os.sep.join(["/mnt/data", 
                                          "some_schema",
                                          f"patition1={var1}", 
                                          f"patition2={var2}",
                                          "*", "*.parquet"]))
lf.filter(polars.col("col1") > 123).columns

jmakov avatar May 10 '24 08:05 jmakov

@jmakov any chance you could try using polars 0.20.19 and see if your parquet file works with that version? I recently upgraded from that version to 0.20.25 and am now having the same issue you're seeing (using the same parquet files that worked with 0.20.19). Unfortunately, I'm having a hard time creating a MRE.

ATL2001 avatar May 15 '24 01:05 ATL2001

@ATL2001 thanks for the tip. You're right, 0.20.19 works. I also had hard time investigating and recreating a MRE, don't have enough time for that. But at least we know now it's a regression. Thanks!

jmakov avatar May 15 '24 07:05 jmakov

Still present in version 0.20.29

jmakov avatar May 24 '24 10:05 jmakov

There is a minimal repro here for a different issue:

  • https://github.com/pola-rs/polars/issues/16385#issuecomment-2123606240

But it is also about partitioned datasets, and the same error.

It may be the same underlying problem as described here.

cmdlineluser avatar May 24 '24 10:05 cmdlineluser

My repro for this seems to have been fixed on main.

Not 100% sure if this is the case, but I believe this gets fixed by https://github.com/pola-rs/polars/pull/16549 (notably the removal of Default::default() for the hive partition info.

I have integration tests in my code which encountered this exact bug and it seems to have been fixed when I compiled main too.

kszlim avatar May 28 '24 18:05 kszlim

Not 100% sure if this is the case, but I believe this gets fixed by https://github.com/pola-rs/polars/pull/16549 (notably the removal of Default::default() for the hive partition info.

Yes, that's the case.

ritchie46 avatar May 29 '24 08:05 ritchie46

Thanks everyone! I just upgraded to 0.20.31, and the panic is gone! 😀

ATL2001 avatar Jun 05 '24 15:06 ATL2001