polars icon indicating copy to clipboard operation
polars copied to clipboard

List dtype error with `list.to_array` in LazyFrame

Open Wouittone opened this issue 1 year ago • 5 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

(
    pl.LazyFrame({"id": [1, 1, 2, 2, 2, 3, 3, 3, 3], "values": [1, 2, 3, 4, 5, 6, 7, 8, 9]})
      .group_by("id")
      .agg(pl.concat([pl.col("values")]).alias("lists"))
      .select(pl.col("lists").list.gather([0, 1]).list.to_array(2).alias("lists_start"))
      .collect()
)

Log output

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[79], line 6
      1 (
      2     pl.LazyFrame({"id": [1, 1, 2, 2, 2, 3, 3, 3, 3], "values": [1, 2, 3, 4, 5, 6, 7, 8, 9]})
      3       .group_by("id")
      4       .agg(pl.concat([pl.col("values")]).alias("lists"))
      5       .select(pl.col("lists").list.gather([0, 1]).list.to_array(2).alias("lists_start"))
----> 6       .collect()
      7 )

File ~/.micromamba/envs/myenv/lib/python3.12/site-packages/polars/lazyframe/frame.py:1934, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, background, _eager)
   1931 if background:
   1932     return InProcessQuery(ldf.collect_concurrently())
-> 1934 return wrap_df(ldf.collect())

ComputeError: expected List dtype

Error originated just after this operation:
AGGREGATE
	[col("values").concat_expr().alias("lists")] BY [col("id")] FROM
  DF ["id", "values"]; PROJECT */2 COLUMNS; SELECTION: "None"

Issue description

This is a strange error where applying .list.to_array in a LazyFrame after a grouping operation that concatenates some values into a list seems to wrongly infer the column type (at least that is what I understand from the ComputeError: expected List dtype error message.)

Running the example without the array cast works and the column dtype is a pl.List:

(
    pl.LazyFrame({"id": [1, 1, 2, 2, 2, 3, 3, 3, 3], "values": [1, 2, 3, 4, 5, 6, 7, 8, 9]})
      .group_by("id")
      .agg(pl.concat([pl.col("values")]).alias("lists"))
      .select(pl.col("lists").list.gather([0, 1]).alias("lists_start"))
      .collect()
)

# shape: (3, 1)
# ┌─────────────┐
# │ lists_start │
# │ ---         │
# │ list[i64]   │
# ╞═════════════╡
# │ [1, 2]      │
# │ [6, 7]      │
# │ [3, 4]      │
# └─────────────┘

Expected behavior

This seems to work fine for eager DataFrame:

(
    pl.DataFrame({"id": [1, 1, 2, 2, 2, 3, 3, 3, 3], "values": [1, 2, 3, 4, 5, 6, 7, 8, 9]})
      .group_by("id")
      .agg(pl.concat([pl.col("values")]).alias("lists"))
      .select(pl.col("lists").list.gather([0, 1]).list.to_array(2).alias("lists_start"))
)

And also for a LazyFrame without a groupby:

(
    pl.LazyFrame({"id": [1, 2, 3], "lists": [[1, 2], [3, 4, 5], [6, 7, 8, 9]]})
      .select(pl.col("lists").list.gather([0, 1]).list.to_array(2).alias("lists_start"))
      .collect()
)

I was expecting the LazyFrame with a grouby operation to work just as those two.

Installed versions

--------Version info---------
Polars:               0.20.19
Index type:           UInt32
Platform:             Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python:               3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.2.0
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           3.8.3
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.1
pydantic:             2.6.3
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.28
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

Wouittone avatar Apr 10 '24 09:04 Wouittone

The key point is that the schema of FunctionExpr in agg seems doesn't has a list type(in schema).

pl.LazyFrame({"a": [1,None], "b": [1,2]})
        .group_by("b").agg(pl.col("a").drop_nulls())
        .schema
OrderedDict([('b', Int64), ('a', Int64)])

@ritchie46 should have some insight.

reswqa avatar Apr 10 '24 10:04 reswqa

I think function expr should call the to_field with Context::Aggregation. Not sure, but the schema is incorrect.

ritchie46 avatar Apr 12 '24 09:04 ritchie46

I think function expr should call the to_field with Context::Aggregation.

You mean when function get its input field? If we had agg(pl.col("a").foo()), the input to foo() would be a list in Context::Aggregation, which also feels problematic.

Maybe FunctionExpr should handle Context in function.get_field (right now they are completely ignored). The only thing that bothering me is the return_scalar=True case, which is somewhat propagate at the moment. For instance, agg(pl.col("a").sum().pow(2))) is not nested even though pow does not return_scalar itself, so we can't simply wrap a list on its schema.

reswqa avatar Apr 12 '24 09:04 reswqa

I was about to start working on this a bit further, but I now realize that I probably shouldn't have used a concat in the aggregation step.

I just tried running the example again without it and it works just as expected:

(
    pl.LazyFrame({"id": [1, 1, 2, 2, 2, 3, 3, 3, 3], "values": [1, 2, 3, 4, 5, 6, 7, 8, 9]})
      .group_by("id")
      .agg("values")
      .select(pl.col("values").list.gather([0, 1]).list.to_array(2).alias("lists_start"))
      .collect()
)

>>> shape: (3, 1)
┌───────────────┐
│ lists_start   │
│ ---           │
│ array[i64, 2] │
╞═══════════════╡
│ [6, 7]        │
│ [1, 2]        │
│ [3, 4]        │
└───────────────┘

I personally feel that it would be fine to leave it as is, and maybe improve the documentation on this point if you think this is relevant? I'd be happy to give it a shot, if so.

Wouittone avatar Apr 27 '24 00:04 Wouittone

(Sorry for the misclick... :disappointed:)

Wouittone avatar Apr 27 '24 00:04 Wouittone