polars icon indicating copy to clipboard operation
polars copied to clipboard

Add array function to polars expressions

Open corwinjoy opened this issue 1 year ago • 4 comments

Intent: Add a function to create a new array column from a set of input columns (similar to concat_list) as discussed in https://github.com/pola-rs/polars/issues/18090 and https://github.com/pola-rs/polars/issues/8510.

This is a draft PR for discussion and to see what such a function would look like. The exact semantics have not been fully fleshed out.

corwinjoy avatar Oct 03 '24 01:10 corwinjoy

Here are some of the issues and semantics I am uncertain about:

  1. I believe the array function should take an optional DataType in order to create an array of a particular target type. I have done this in an awkward way, and could use help here on how to pass a DataType down to an expression. (It seems like I can use Wrap<DataType> at the top level but not down an the expression level). When a target type is specified, it will try to cast/convert columns to that target type.
  2. If a target type is not specified, right now I use the first column as the target type. Another option could be to use try_get_supertype but this is a very broad upcast and I'm not convinced this function should apply beyond primitive types.
  3. Right now, this function only works on numeric types. Probably String types should be added but I'm not quite sure how to create a String Array since the builders seem to only work on numeric types.
  4. Should this function work on Array and List columns? If so, it should probably flatten lists by one level as done by concat_list and discussed in https://github.com/pola-rs/polars/issues/8510. That is array(array(pl.all())) = array(pl.all()). In other words, in previous discussions, it was mentioned that this function should be idempotent.

I think those are the main questions I had about the design and the other parts are more polishing the code since I am not all that expert in the existing code base. Also, if we are adding array perhaps a new list function should be created at the same time in line with https://github.com/pola-rs/polars/issues/8510#issuecomment-2200865587 ?

corwinjoy avatar Oct 03 '24 01:10 corwinjoy

I guess the other way to write this would be via a call to concat_list. But then, you give up a lot of opportunities for type checking (e.g. you may not want to glue together ragged rows). I think part of the impetus for a new function here would be to have clearer and more explicit behavior when constructing arrays. To be explicit, below is an example of creating an array via concat_list. As mentioned, the problems I see with just reusing concat_list are efficiency and possibly unwanted behavior like wrapping rows.

df = pl.DataFrame(
    [
        pl.Series("f1", [1, 2]),
        pl.Series("f2", [3, None]),
    ],
    schema={
        "f1": pl.Int32,
        "f2": pl.Float64,
    },
)
result = print(df.with_columns(arr=pl.concat_list(pl.all()).list.to_array(2)))


shape: (2, 3)
┌─────┬──────┬───────────────┐
│ f1  ┆ f2   ┆ arr           │
│ --- ┆ ---  ┆ ---           │
│ i32 ┆ f64  ┆ array[f64, 2] │
╞═════╪══════╪═══════════════╡
│ 1   ┆ 3.0  ┆ [1.0, 3.0]    │
│ 2   ┆ null ┆ [2.0, null]   │
└─────┴──────┴───────────────┘

corwinjoy avatar Oct 03 '24 02:10 corwinjoy

Should this function work on Array and List columns?

I think this is a high priority for me.

If so, it should probably flatten lists by one level as done by concat_list

Can you elaborate why you think this is needed? I think we should try to avoid this, and I think if we are careful with our API (ie being more strict with what we accept, so it's not ambiguous how we will interpret it) then this is avoidable. ie the flexible API of array(arg_or_args: IntoExpr | Iterable[IntoExpr], *more_args: IntoExpr) is dangerous, we either need to go full variadic with array(*args: IntoExpr) or full monadic with array(args: Iterable[IntoExpr])

NickCrews avatar Oct 03 '24 06:10 NickCrews

Should this function work on Array and List columns?

I think this is a high priority for me.

If so, it should probably flatten lists by one level as done by concat_list

Can you elaborate why you think this is needed? I think we should try to avoid this, and I think if we are careful with our API (ie being more strict with what we accept, so it's not ambiguous how we will interpret it) then this is avoidable. ie the flexible API of array(arg_or_args: IntoExpr | Iterable[IntoExpr], *more_args: IntoExpr) is dangerous, we either need to go full variadic with array(*args: IntoExpr) or full monadic with array(args: Iterable[IntoExpr])

I think it would be reasonable to give up the iterable syntax, (given that we cannot have empty arrays), and only support the variadic syntax of array(*args: IntoExpr). The reason for choosing this route would be to support glob expansions expressions like pl.all().

corwinjoy avatar Oct 04 '24 21:10 corwinjoy

Closing this for now. I think the spec and how this should behave is too unclear. For now, concat_list(..).list.to_array(..) is workable and I think this can be revised when the Polars team has the chance to firm up what these functions should look like.

corwinjoy avatar Nov 01 '24 22:11 corwinjoy

Thank you @corwinjoy for the effort though! this prior art is going to be invaluable for when someone does get to solving this. Thanks @ritchie46 for the feedback too. If you ever want to spend more time firming up the API/semantics should actually look like, I would be willing to write up a concrete spec/rationale for people to comment on.

NickCrews avatar Nov 01 '24 23:11 NickCrews