datafusion
datafusion copied to clipboard
select multiple columns in a single `Expr`
Is your feature request related to a problem or challenge?
I want to create a udf that can select multiple functions at once, such as a COLUMNS(<regex>)
function.
select COLUMNS('number\d+') from my_table
.
Currently this is not possible due to the fact that udfs can only ever output a single Expr
.
Describe the solution you'd like
Since it would be quite a massive overhaul to refactor all of the planning and udf logic to return Vec<Expr>
, I propose adding a new variant to Expr
.
Expr::Columns(Vec<Column>)
.
This seems like the least invasive way to support selecting multiple columns in a single expr.
Describe alternatives you've considered
I'm open for alternatives, but I am not aware of any.
Additional context
Polars has this variant in their Expr
https://github.com/pola-rs/polars/blob/9fec2ecb6d4295969e1d155b386ee82db08745a1/crates/polars-plan/src/dsl/expr.rs#L72
It looks good to me, and we can gradually deprecate single column in the long-term
@universalmind303 are you already working on this?
@universalmind303 are you already working on this?
no, feel free to go ahead and work on it if you'd like.
This would require also modifying the protobuf definitions, are we ok with it?
How about returning DataType::Struct
or DataType::List
or DataType::FixedSizeList
?
My current wip has a struct since it supports different data types and field names, can a List work as well?
My current wip has a struct since it supports different data types and field names, can a List work as well?
No, you are right, a List can't really support multiple different types
@universalmind303 I think it would be interesting to test it, do you have an example / have you written the UDF?
I wonder if struct(col1, col2)
or named_struct('a', col1, 'b', col2)
would work for your usecase 🤔
> select struct(column1, column2) from abc;
+---------------------------------+
| struct(abc.column1,abc.column2) |
+---------------------------------+
| {c0: 1, c1: 2} |
| {c0: 4, c1: 5} |
+---------------------------------+
2 row(s) fetched.
Elapsed 0.004 seconds.
that works, but isn't this unrelated to this change? I suppose it was working before too...
@alamb by the way, how does the struct
udf works? I am not sure I understand how this select only a subset of the fields, does Datafusion drop from the resulting dataset the fields that are not in the return_type
? https://github.com/apache/datafusion/blob/main/datafusion/functions/src/core/struct.rs#L97
@alamb by the way, how does the struct udf works?
I think https://datafusion.apache.org/user-guide/sql/scalar_functions.html#struct has some pretty good examples
Apologies I wasn't precise in my question. The code of the struct udf doesn't explicitly drop fields but returns all of them.
How does the selection happens ?
How does the selection happens ?
I think you can select with something like my_struct_col['foo']
which returns the 'foo' field.
However now re-reading this ticket, I think it would be possible to create a user defined function like COLUMN
by Implemening return_type_from_exprs
to
🤔
How does the selection happens ?
I think you can select with something like
my_struct_col['foo']
which returns the 'foo' field.However now re-reading this ticket, I think it would be possible to create a user defined function like
COLUMN
by Implemeningreturn_type_from_exprs
to 🤔
What I was trying to say is that looking at that UDF isn't clear for me how the other fields are dropped. It seems that Datafusion performs an intersection between the field names returned by the return type and the ones returned by array_struct ?
FWIW, https://github.com/apache/datafusion/issues/10102 seems related as multi selection could be implemented via struct
and unnest
.
How does the selection happens ?
I think you can select with something like
my_struct_col['foo']
which returns the 'foo' field. However now re-reading this ticket, I think it would be possible to create a user defined function likeCOLUMN
by Implemeningreturn_type_from_exprs
to 🤔What I was trying to say is that looking at that UDF isn't clear for me how the other fields are dropped. It seems that Datafusion performs an intersection between the field names returned by the return type and the ones returned by array_struct ?
@edmondop
By default the column name is named c0, c1 ... which defined in array_struct
, so in return_type
we also find the associated column with name c0, c1 ...
how the other fields are dropped
I don't think we drop any field in struct.
One of the example can be
statement ok
create table t(a1 int, a2 float) as values(1, 1.2), (2, 2.4);
query IR
select a1, a2 from t;
----
1 1.2
2 2.4
// make this works
query IR
select columns(a1, a2) from t;
----
1 1.2
2 2.4
I want to create a udf that can select multiple functions at once, such as a COLUMNS(
select COLUMNS('number\d+') from my_table.
looking at the struct UDF, it seems that it only receives the columns that are passed to function invocation and doesn't have access to other columns, i.e. in the struct
udf, all columns are returned as a struct, which suggests that the engine only pass a subset of the available columns
However, in the case of COLUMNS('number\d+'), you need to have all the columns, and only return few of them from the function. In my understanding neither udf
nor udaf
or udwf
from what I have seen. We would need maybe a fourth type of udf something like udgf
which is invoked on all the columns of a logical plan?
However, in the case of COLUMNS('number\d+'), you need to have all the columns, and only return few of them from the function
I agree, we can't get all the columns by the current design of function, it is quite challenging than I thought 🤔. We expect to build up a projection plan given the syntax columns(..)
. The function mentioned here is similar to unnest
, it is more like a table
function. Therefore, I think we need a user-defined table function that we can define the returned Expr
given the function we parsed in the parser.
The difference behaviour in parser (datafusion/sql) between these.
- UDF/UDAF/UDWF: returns
Expr::ScalarFunction
/Expr::AggregateFunction
/Expr::WindowFunction
respectively - Unnest: returns
Expr::Unnest
-
Columns(regex)
: returnsExpr::Columns(Vec<Column>)
- UDTF (user-defined table function): returns
Expr
To make the mentioned function possible, we don;t even need to introduce Expr::Columns
, but returns Vec<Expr>
for sql_to_expr
.
https://github.com/apache/datafusion/blob/7535d93fa631f0fc42ca1521bc257ca20480b246/datafusion/sql/src/select.rs#L449-L457
@jayzhan211 that function already returns a Vec<Expr>
, or am I missing something? Do we still need to have udtf?
@jayzhan211 that function already returns a Vec<Expr>, or am I missing something? Do we still need to have udtf?
sql_to_expr
returns Result<Expr>
, we can extend it to Result<Vec<Expr>>
. Which function are you mentioning?
I didn't find equivalent behavior in postgres. I'm not sure should we support this kind of returns subset of columns based on column name matching