polars
polars copied to clipboard
ColumnNotFoundError after `unnest`
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
ColumnNotFound after unnest
Reproducible example
import polars as pl
pl.LazyFrame({"a": ["a|b,c", "d,e|f", "g,h|i,j"], "b": range(3)}).with_columns(
pl.col("a")
.str.split("|")
.arr.to_struct(
name_generator=lambda idx: f"a_{idx}",
)
).unnest("a").with_columns(
pl.col("a_1").str.split(","),
).collect()
ColumnNotFoundError Traceback (most recent call last)
Cell In[84], line 9
1 import polars as pl
3 pl.LazyFrame({"a": ["a|b,c", "d,e|f", "g,h|i,j"], "b": range(3)}).with_columns(
4 pl.col("a")
5 .str.split("|")
6 .arr.to_struct(
7 name_generator=lambda idx: f"a_{idx}",
8 )
----> 9 ).unnest("a").with_columns(pl.col("a_1").str.split(","),).collect()
File ~/apps/polars/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:1475, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
1464 common_subplan_elimination = False
1466 ldf = self._ldf.optimization_toggle(
1467 type_coercion,
1468 predicate_pushdown,
(...)
1473 streaming,
1474 )
-> 1475 return wrap_df(ldf.collect())
ColumnNotFoundError: a_1
Error originated just after this operation:
UNNEST by:[a]
WITH_COLUMNS:
[col("a").str.split().arr.to_struct()]
DF ["a", "b"]; PROJECT */2 COLUMNS; SELECTION: "None"
Expected behavior
I thought that column a_0 and a_1 should be available to manipulate again
Installed versions
---Version info---
Polars: 0.16.15
Index type: UInt32
Platform: macOS-13.2.1-arm64-arm-64bit
Python: 3.11.2 (main, Feb 16 2023, 02:51:42) [Clang 14.0.0 (clang-1400.0.29.202)]
---Optional dependencies---
numpy: <not installed>
pandas: <not installed>
pyarrow: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fsspec: <not installed>
matplotlib: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
The name of the columns are opaque to us as they are generated at runtime in python. We should add a schema argument so that users can provide one.
Yeah, this is the same as #5220.
I think I understand– because the split depends on data that is loaded at runtime, polars does not know how many a_X columns there are (if any).
Would a way for specifying a schema for the struct work? N columns, a type, a default value?
Sometimes these will not be known. But it is often the case that they are known. They can be specified, and an exception when they don't hold up would be valuable information.
to_struct now has an upper_bound parameter that is supposed to fix this, but it doesn't seem to work:
import polars as pl
lf = pl.LazyFrame({"a": ["a|b,c", "d,e|f", "g,h|i,j"]})
lf = lf.select(
pl.col("a")
.str.split("|")
.list.to_struct(fields=lambda idx: f"a_{idx}", upper_bound=2)
)
result = lf.unnest("a").select(pl.col("a_1").str.split(",")).collect()
print(result)
# polars.exceptions.ColumnNotFoundError: a_1
If you know a reasonable upper bound of the split then you can skip the struct like this:
splt=pl.col("a").str.split("|")
upper_bound = 5
print(pl.LazyFrame({"a": ["a|b,c", "d,e|f", "g,h|i,j"], "b": range(3)}).with_columns(
splt.list.get(x).alias(f"a_{x}")
for x in range(upper_bound)
).collect())
shape: (3, 7)
┌─────────┬─────┬─────┬─────┬──────┬──────┬──────┐
│ a ┆ b ┆ a_0 ┆ a_1 ┆ a_2 ┆ a_3 ┆ a_4 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═════════╪═════╪═════╪═════╪══════╪══════╪══════╡
│ a|b,c ┆ 0 ┆ a ┆ b,c ┆ null ┆ null ┆ null │
│ d,e|f ┆ 1 ┆ d,e ┆ f ┆ null ┆ null ┆ null │
│ g,h|i,j ┆ 2 ┆ g,h ┆ i,j ┆ null ┆ null ┆ null │
└─────────┴─────┴─────┴─────┴──────┴──────┴──────┘