polars icon indicating copy to clipboard operation
polars copied to clipboard

ColumnNotFoundError after `unnest`

Open janrito opened this issue 2 years ago • 3 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

ColumnNotFound after unnest

Reproducible example

import polars as pl

pl.LazyFrame({"a": ["a|b,c", "d,e|f", "g,h|i,j"], "b": range(3)}).with_columns(
    pl.col("a")
    .str.split("|")
    .arr.to_struct(
        name_generator=lambda idx: f"a_{idx}",
    )
).unnest("a").with_columns(
    pl.col("a_1").str.split(","),
).collect()

ColumnNotFoundError                       Traceback (most recent call last)
Cell In[84], line 9
      1 import polars as pl
      3 pl.LazyFrame({"a": ["a|b,c", "d,e|f", "g,h|i,j"], "b": range(3)}).with_columns(
      4     pl.col("a")
      5     .str.split("|")
      6     .arr.to_struct(
      7         name_generator=lambda idx: f"a_{idx}",
      8     )
----> 9 ).unnest("a").with_columns(pl.col("a_1").str.split(","),).collect()

File ~/apps/polars/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:1475, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
   1464     common_subplan_elimination = False
   1466 ldf = self._ldf.optimization_toggle(
   1467     type_coercion,
   1468     predicate_pushdown,
   (...)
   1473     streaming,
   1474 )
-> 1475 return wrap_df(ldf.collect())

ColumnNotFoundError: a_1

Error originated just after this operation:
UNNEST by:[a]
   WITH_COLUMNS:
   [col("a").str.split().arr.to_struct()]
    DF ["a", "b"]; PROJECT */2 COLUMNS; SELECTION: "None"

Expected behavior

I thought that column a_0 and a_1 should be available to manipulate again

Installed versions

---Version info---
Polars: 0.16.15
Index type: UInt32
Platform: macOS-13.2.1-arm64-arm-64bit
Python: 3.11.2 (main, Feb 16 2023, 02:51:42) [Clang 14.0.0 (clang-1400.0.29.202)]
---Optional dependencies---
numpy: <not installed>
pandas: <not installed>
pyarrow: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fsspec: <not installed>
matplotlib: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>

janrito avatar Mar 24 '23 18:03 janrito

The name of the columns are opaque to us as they are generated at runtime in python. We should add a schema argument so that users can provide one.

ritchie46 avatar Mar 24 '23 19:03 ritchie46

Yeah, this is the same as #5220.

mcrumiller avatar Mar 24 '23 20:03 mcrumiller

I think I understand– because the split depends on data that is loaded at runtime, polars does not know how many a_X columns there are (if any).

Would a way for specifying a schema for the struct work? N columns, a type, a default value?

Sometimes these will not be known. But it is often the case that they are known. They can be specified, and an exception when they don't hold up would be valuable information.

janrito avatar Mar 26 '23 11:03 janrito

to_struct now has an upper_bound parameter that is supposed to fix this, but it doesn't seem to work:

import polars as pl

lf = pl.LazyFrame({"a": ["a|b,c", "d,e|f", "g,h|i,j"]})
lf = lf.select(
    pl.col("a")
    .str.split("|")
    .list.to_struct(fields=lambda idx: f"a_{idx}", upper_bound=2)
)
result = lf.unnest("a").select(pl.col("a_1").str.split(",")).collect()
print(result)
# polars.exceptions.ColumnNotFoundError: a_1

stinodego avatar Jan 18 '24 23:01 stinodego

If you know a reasonable upper bound of the split then you can skip the struct like this:

splt=pl.col("a").str.split("|")
upper_bound = 5
print(pl.LazyFrame({"a": ["a|b,c", "d,e|f", "g,h|i,j"], "b": range(3)}).with_columns(
    splt.list.get(x).alias(f"a_{x}")
    for x in range(upper_bound)
).collect())
shape: (3, 7)
┌─────────┬─────┬─────┬─────┬──────┬──────┬──────┐
│ a       ┆ b   ┆ a_0 ┆ a_1 ┆ a_2  ┆ a_3  ┆ a_4  │
│ ---     ┆ --- ┆ --- ┆ --- ┆ ---  ┆ ---  ┆ ---  │
│ str     ┆ i64 ┆ str ┆ str ┆ str  ┆ str  ┆ str  │
╞═════════╪═════╪═════╪═════╪══════╪══════╪══════╡
│ a|b,c   ┆ 0   ┆ a   ┆ b,c ┆ null ┆ null ┆ null │
│ d,e|f   ┆ 1   ┆ d,e ┆ f   ┆ null ┆ null ┆ null │
│ g,h|i,j ┆ 2   ┆ g,h ┆ i,j ┆ null ┆ null ┆ null │
└─────────┴─────┴─────┴─────┴──────┴──────┴──────┘

deanm0000 avatar Jan 30 '24 13:01 deanm0000