polars
polars copied to clipboard
Panic on `polars.scan_parquet().filter().columns`
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
lf = polars.scan_parquet()
lf.columns # works
lf.filter(col1 > 123).columns # panics
Log output
thread 'python' panicked at crates/polars-plan/src/logical_plan/optimizer/predicate_pushdown/mod.rs:359:69:
called `Option::unwrap()` on a `None` value
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
Cell In[10], line 1
----> 1 _lf.filter(polars.col("ts_event")>123).columns
File ~/mambaforge-pypy3/envs/quantlab/lib/python3.11/site-packages/polars/lazyframe/frame.py:411, in LazyFrame.columns(self)
394 @property
395 def columns(self) -> list[str]:
396 """
397 Get column names.
398
(...)
409 ['foo', 'bar']
410 """
--> 411 return self._ldf.columns()
PanicException: called `Option::unwrap()` on a `None` value
Issue description
Panic on polars.scan_parquet().filter().columns. Also why are you calling unwrap() in production code?
Expected behavior
Printout of columns
Installed versions
Polars: 0.20.25
Index type: UInt32
Platform: Linux-6.6.25-1-MANJARO-x86_64-with-glibc2.39
Python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: 3.0.0
connectorx: 0.3.2
deltalake: <not installed>
fastexcel: <not installed>
fsspec: 2024.3.1
gevent: 24.2.1
hvplot: 0.10.0
matplotlib: 3.7.5
nest_asyncio: 1.6.0
numpy: 1.26.4
openpyxl: <not installed>
pandas: 2.1.4
pyarrow: 14.0.2
pydantic: 2.7.1
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: 2.0.30
torch: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
Have you got a repro with a dummy file?
No. Tried this but it works so I'd have to look a bit deeper what triggers the problem:
# this works
df = polars.DataFrame({"col1": range(10)})
df.write_parquet("test.parquet")
lf = polars.scan_parquet("test.parquet")
lf.filter(polars.col("col1") > 3).columns
The original parquet dataset is partitioned. And this fails:
# prquet dataset structured as /mnt/data/schema_name/partition1/partition2/year/month.parquet
lf = polars.scan_parquet(os.sep.join(["/mnt/data",
"some_schema",
f"patition1={var1}",
f"patition2={var2}",
"*", "*.parquet"]))
lf.filter(polars.col("col1") > 123).columns
@jmakov any chance you could try using polars 0.20.19 and see if your parquet file works with that version? I recently upgraded from that version to 0.20.25 and am now having the same issue you're seeing (using the same parquet files that worked with 0.20.19). Unfortunately, I'm having a hard time creating a MRE.
@ATL2001 thanks for the tip. You're right, 0.20.19 works. I also had hard time investigating and recreating a MRE, don't have enough time for that. But at least we know now it's a regression. Thanks!
Still present in version 0.20.29
There is a minimal repro here for a different issue:
- https://github.com/pola-rs/polars/issues/16385#issuecomment-2123606240
But it is also about partitioned datasets, and the same error.
It may be the same underlying problem as described here.
My repro for this seems to have been fixed on main.
Not 100% sure if this is the case, but I believe this gets fixed by https://github.com/pola-rs/polars/pull/16549 (notably the removal of Default::default() for the hive partition info.
I have integration tests in my code which encountered this exact bug and it seems to have been fixed when I compiled main too.
Not 100% sure if this is the case, but I believe this gets fixed by https://github.com/pola-rs/polars/pull/16549 (notably the removal of Default::default() for the hive partition info.
Yes, that's the case.
Thanks everyone! I just upgraded to 0.20.31, and the panic is gone! 😀