datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Support arbitrary expressions in `LIMIT` clause

Open alamb opened this issue 1 year ago • 5 comments

Follow on to https://github.com/apache/arrow-datafusion/issues/9506

The idea is to support arbitrary expressions that can be consolidated to a constant in the LIMIT clause. For example

❯ select * from (values (1)) LIMIT 10*100;
Error during planning: Unsupported operator for LIMIT clause

This query should be able to run (and return the single value)

❯ select * from (values (1)) LIMIT 10+100;
+---------+
| column1 |
+---------+
| 1       |
+---------+

https://github.com/apache/arrow-datafusion/pull/9790 adds support for basic +/- but the general purpose solution that would handle any expr that can be consolidated to a constant would be better

As suggested by @jonahgao this might look like change the Limit logical plan to support arbitrary expressions?

pub struct Limit {
    pub skip: Expr,
    pub fetch: Option<Expr>,
    pub input: Arc<LogicalPlan>,
}

The SimplifyExpressions rule can automatically optimize them into constants. Some optimization rules such as PushDownLimit only run when the limit expression is a constant. We may need to add a cast for the limit expression when planning, only checking if it is a constant of type u64.

When creating the LimitExec physical plan, convert the limit expression into PhysicalExpr and evaluate it.

Originally posted by @jonahgao in https://github.com/apache/arrow-datafusion/pull/9790#discussion_r1539358701

alamb avatar Mar 27 '24 13:03 alamb

I can do this one since I have been doing a related limit pushdown feature.

Lordworms avatar Mar 27 '24 13:03 Lordworms

wait until https://github.com/apache/arrow-datafusion/pull/9815#issue-2209595815 merged

Lordworms avatar Mar 31 '24 15:03 Lordworms

This one is probably ready to work on now

alamb avatar Oct 15 '24 10:10 alamb

While re-reading this I think we should start with implementhing limits that can be evaluated to a constant by the time the physical plan is created (aka don't change the physical execution plans)

It is not clear to me what LIMIT 100 + x means

The key usecases are:

  • LIMIT $1 -- parameters like https://github.com/apache/datafusion/issues/12294
  • LIMIT 8 * 1024 -- expressions that evaluate to constants

alamb avatar Oct 15 '24 10:10 alamb

While re-reading this I think we should start with implementhing limits that can be evaluated to a constant by the time the physical plan is created (aka don't change the physical execution plans)

I will switch my focus to working on it.

jonahgao avatar Oct 15 '24 14:10 jonahgao