datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Support limit in StreamingTableExec

Open lewiszlw opened this issue 9 months ago • 9 comments

Which issue does this PR close?

Closes todo.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

lewiszlw avatar Apr 30 '24 11:04 lewiszlw

Seems due to filter, limit won't be pushed down to table provider. So I added a simple test in limit.slt .

lewiszlw avatar May 01 '24 08:05 lewiszlw

FYI @metesynnada

alamb avatar May 01 '24 11:05 alamb

@alamb, this looks good to me. However, I'm concerned there might be redundant LimitStream overhead when both GlobalLimitExec and StreamTableExec are present in the plan, as LimitStream appears to be used twice.

Additionally, it would be better to use unit tests addition to slt tests, given that GlobalLimitExec already ensures the tests function properly even if StreamTableExec fails.

metesynnada avatar May 01 '24 12:05 metesynnada

@alamb, this looks good to me. However, I'm concerned there might be redundant LimitStream overhead when both GlobalLimitExec and StreamTableExec are present in the plan, as LimitStream appears to be used twice.

Additionally, it would be better to use unit tests addition to slt tests, given that GlobalLimitExec already ensures the tests function properly even if StreamTableExec fails.

Those are excellent points. @lewiszlw are you able to add the unit tests? Otherwise I can try and find some time to do so

alamb avatar May 01 '24 13:05 alamb

Hmm, I don't get the point. How should I add unit tests?

lewiszlw avatar May 01 '24 14:05 lewiszlw

Hmm, I don't get the point. How should I add unit tests?

The executor should be capable of running independently within a test. For more insight, refer to the hash join tests in datafusion/physical-plan/src/joins/hash_join.rs. Your tests ofc will be much more simple.

Currently, a GlobalLimitExec is already implemented at the top of the existing tests, which means these tests wouldn't fail if the limit implementation in StreamingTableExec were to change.

metesynnada avatar May 02 '24 07:05 metesynnada

I think the idea would be to create a StreamingTableExec directly and then read data from it, ensuring that the limit was applied

I didn't see any existing unit tests for StreamingTableExec so we would have to write some I think

https://github.com/apache/datafusion/blob/fa31c781d2cb0bdfba06dcc07bc75d9a5f9686b2/datafusion/physical-plan/src/streaming.rs#L221

Maybe I missed it

alamb avatar May 02 '24 12:05 alamb

If you don't have time to do this @lewiszlw I can try and help, or perhaps we can file a ticket and do it as a follow on PR

alamb avatar May 02 '24 12:05 alamb

If you don't have time to do this @lewiszlw I can try and help, or perhaps we can file a ticket and do it as a follow on PR

Yeah, I'm on vacation. So might not action in time.

lewiszlw avatar May 02 '24 13:05 lewiszlw

If you don't have time to do this @lewiszlw I can try and help, or perhaps we can file a ticket and do it as a follow on PR

Yeah, I'm on vacation. So might not action in time.

No problem -- I am giving it a shot

alamb avatar May 03 '24 16:05 alamb

I pushed a unit test in 30bf3bf

alamb avatar May 03 '24 16:05 alamb

Thanks again @lewiszlw and @metesynnada

alamb avatar May 06 '24 16:05 alamb