datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

select multiple columns in a single `Expr`

Open universalmind303 opened this issue 10 months ago • 22 comments

Is your feature request related to a problem or challenge?

I want to create a udf that can select multiple functions at once, such as a COLUMNS(<regex>) function.

select COLUMNS('number\d+') from my_table.

Currently this is not possible due to the fact that udfs can only ever output a single Expr.

Describe the solution you'd like

Since it would be quite a massive overhaul to refactor all of the planning and udf logic to return Vec<Expr>, I propose adding a new variant to Expr.

Expr::Columns(Vec<Column>).

This seems like the least invasive way to support selecting multiple columns in a single expr.

Describe alternatives you've considered

I'm open for alternatives, but I am not aware of any.

Additional context

Polars has this variant in their Expr

https://github.com/pola-rs/polars/blob/9fec2ecb6d4295969e1d155b386ee82db08745a1/crates/polars-plan/src/dsl/expr.rs#L72

universalmind303 avatar Apr 16 '24 20:04 universalmind303

It looks good to me, and we can gradually deprecate single column in the long-term

jayzhan211 avatar Apr 19 '24 13:04 jayzhan211

@universalmind303 are you already working on this?

edmondop avatar Apr 20 '24 21:04 edmondop

@universalmind303 are you already working on this?

no, feel free to go ahead and work on it if you'd like.

universalmind303 avatar Apr 21 '24 23:04 universalmind303

This would require also modifying the protobuf definitions, are we ok with it?

edmondop avatar Apr 22 '24 00:04 edmondop

How about returning DataType::Struct or DataType::List or DataType::FixedSizeList?

alamb avatar Apr 23 '24 22:04 alamb

My current wip has a struct since it supports different data types and field names, can a List work as well?

edmondop avatar Apr 23 '24 22:04 edmondop

My current wip has a struct since it supports different data types and field names, can a List work as well?

No, you are right, a List can't really support multiple different types

alamb avatar Apr 24 '24 00:04 alamb

@universalmind303 I think it would be interesting to test it, do you have an example / have you written the UDF?

edmondop avatar Apr 24 '24 23:04 edmondop

I wonder if struct(col1, col2) or named_struct('a', col1, 'b', col2) would work for your usecase 🤔

alamb avatar Apr 24 '24 23:04 alamb

> select struct(column1, column2) from abc;
+---------------------------------+
| struct(abc.column1,abc.column2) |
+---------------------------------+
| {c0: 1, c1: 2}                  |
| {c0: 4, c1: 5}                  |
+---------------------------------+
2 row(s) fetched. 
Elapsed 0.004 seconds.

that works, but isn't this unrelated to this change? I suppose it was working before too...

edmondop avatar Apr 25 '24 00:04 edmondop

@alamb by the way, how does the struct udf works? I am not sure I understand how this select only a subset of the fields, does Datafusion drop from the resulting dataset the fields that are not in the return_type ? https://github.com/apache/datafusion/blob/main/datafusion/functions/src/core/struct.rs#L97

edmondop avatar Apr 25 '24 01:04 edmondop

@alamb by the way, how does the struct udf works?

I think https://datafusion.apache.org/user-guide/sql/scalar_functions.html#struct has some pretty good examples

alamb avatar Apr 25 '24 15:04 alamb

Apologies I wasn't precise in my question. The code of the struct udf doesn't explicitly drop fields but returns all of them.

How does the selection happens ?

edmondop avatar Apr 25 '24 15:04 edmondop

How does the selection happens ?

I think you can select with something like my_struct_col['foo'] which returns the 'foo' field.

However now re-reading this ticket, I think it would be possible to create a user defined function like COLUMN by Implemening return_type_from_exprs to 🤔

alamb avatar Apr 25 '24 17:04 alamb

How does the selection happens ?

I think you can select with something like my_struct_col['foo'] which returns the 'foo' field.

However now re-reading this ticket, I think it would be possible to create a user defined function like COLUMN by Implemening return_type_from_exprs to 🤔

What I was trying to say is that looking at that UDF isn't clear for me how the other fields are dropped. It seems that Datafusion performs an intersection between the field names returned by the return type and the ones returned by array_struct ?

edmondop avatar Apr 27 '24 14:04 edmondop

FWIW, https://github.com/apache/datafusion/issues/10102 seems related as multi selection could be implemented via struct and unnest.

universalmind303 avatar Apr 29 '24 03:04 universalmind303

How does the selection happens ?

I think you can select with something like my_struct_col['foo'] which returns the 'foo' field. However now re-reading this ticket, I think it would be possible to create a user defined function like COLUMN by Implemening return_type_from_exprs to 🤔

What I was trying to say is that looking at that UDF isn't clear for me how the other fields are dropped. It seems that Datafusion performs an intersection between the field names returned by the return type and the ones returned by array_struct ?

@edmondop By default the column name is named c0, c1 ... which defined in array_struct, so in return_type we also find the associated column with name c0, c1 ...

how the other fields are dropped

I don't think we drop any field in struct.

One of the example can be

statement ok
create table t(a1 int, a2 float) as values(1, 1.2), (2, 2.4);

query IR
select a1, a2 from t;
----
1 1.2
2 2.4

// make this works
query IR
select columns(a1, a2) from t;
----
1 1.2
2 2.4

jayzhan211 avatar May 08 '24 08:05 jayzhan211

I want to create a udf that can select multiple functions at once, such as a COLUMNS() function.

select COLUMNS('number\d+') from my_table.

looking at the struct UDF, it seems that it only receives the columns that are passed to function invocation and doesn't have access to other columns, i.e. in the struct udf, all columns are returned as a struct, which suggests that the engine only pass a subset of the available columns

However, in the case of COLUMNS('number\d+'), you need to have all the columns, and only return few of them from the function. In my understanding neither udf nor udaf or udwf from what I have seen. We would need maybe a fourth type of udf something like udgf which is invoked on all the columns of a logical plan?

edmondop avatar May 16 '24 04:05 edmondop

However, in the case of COLUMNS('number\d+'), you need to have all the columns, and only return few of them from the function

I agree, we can't get all the columns by the current design of function, it is quite challenging than I thought 🤔. We expect to build up a projection plan given the syntax columns(..). The function mentioned here is similar to unnest, it is more like a table function. Therefore, I think we need a user-defined table function that we can define the returned Expr given the function we parsed in the parser.

The difference behaviour in parser (datafusion/sql) between these.

  • UDF/UDAF/UDWF: returns Expr::ScalarFunction / Expr::AggregateFunction / Expr::WindowFunction respectively
  • Unnest: returns Expr::Unnest
  • Columns(regex): returns Expr::Columns(Vec<Column>)
  • UDTF (user-defined table function): returns Expr

To make the mentioned function possible, we don;t even need to introduce Expr::Columns, but returns Vec<Expr> for sql_to_expr.

https://github.com/apache/datafusion/blob/7535d93fa631f0fc42ca1521bc257ca20480b246/datafusion/sql/src/select.rs#L449-L457

jayzhan211 avatar May 16 '24 09:05 jayzhan211

@jayzhan211 that function already returns a Vec<Expr>, or am I missing something? Do we still need to have udtf?

edmondop avatar May 16 '24 23:05 edmondop

@jayzhan211 that function already returns a Vec<Expr>, or am I missing something? Do we still need to have udtf?

sql_to_expr returns Result<Expr>, we can extend it to Result<Vec<Expr>>. Which function are you mentioning?

jayzhan211 avatar May 17 '24 00:05 jayzhan211

I didn't find equivalent behavior in postgres. I'm not sure should we support this kind of returns subset of columns based on column name matching

jayzhan211 avatar May 19 '24 01:05 jayzhan211