Daft icon indicating copy to clipboard operation
Daft copied to clipboard

The display of pushdown in explain regarding limit and filter is incorrect.

Open Jay-ju opened this issue 6 months ago • 11 comments

Describe the bug

Question 1: The limit was not pushed down to the scan operator, but the limit expression was still shown in the physical plan.Image

Question 2: The datasource does not support filter pushdown, but the pushdown still includes filter pushdown.

To Reproduce

No response

Expected behavior

No response

Component(s)

Expressions

Additional context

No response

Jay-ju avatar Jun 25 '25 12:06 Jay-ju

Question 1: The limit was not pushed down to the scan operator, but the limit expression was still shown in the physical plan.

  • This is actually correct. The limit expression in the physical plan is a global limit, whereas the limits in the scan operator are local limits. Imagine we did limit(1) and read 2 csv files with 1 row each. We push the limit(1) down into each file correctly, and they both return 1 row. If we didn't have the global limit, we will end up returning 2 rows. The global limit in the physical plan is therefore needed to apply the limit across all morsels / partitions of data read from all the files.

Question 2: The datasource does not support filter pushdown, but the pushdown still includes filter pushdown.

  • Looks like there's work here to support filter pushdowns in lance: https://github.com/Eventual-Inc/Daft/discussions/4615

colin-ho avatar Jun 25 '25 23:06 colin-ho

Question 1: The limit was not pushed down to the scan operator, but the limit expression was still shown in the physical plan.

  • This is actually correct. The limit expression in the physical plan is a global limit, whereas the limits in the scan operator are local limits. Imagine we did limit(1) and read 2 csv files with 1 row each. We push the limit(1) down into each file correctly, and they both return 1 row. If we didn't have the global limit, we will end up returning 2 rows. The global limit in the physical plan is therefore needed to apply the limit across all morsels / partitions of data read from all the files.

Question 2: The datasource does not support filter pushdown, but the pushdown still includes filter pushdown.

@colin-ho Get it. That is, in the datasource, there is a partial limit, which means the datasource supports limit. However, the current datasource in lance does not support limit. So what I added here is that it is necessary to display whether there should be a limit in the datasource according to whether the datasource supports it.

Jay-ju avatar Jun 26 '25 01:06 Jay-ju

Question 1: The limit was not pushed down to the scan operator, but the limit expression was still shown in the physical plan.

  • This is actually correct. The limit expression in the physical plan is a global limit, whereas the limits in the scan operator are local limits. Imagine we did limit(1) and read 2 csv files with 1 row each. We push the limit(1) down into each file correctly, and they both return 1 row. If we didn't have the global limit, we will end up returning 2 rows. The global limit in the physical plan is therefore needed to apply the limit across all morsels / partitions of data read from all the files.

Question 2: The datasource does not support filter pushdown, but the pushdown still includes filter pushdown.

What I want to express here is not that the global limit is not needed. Instead, if the datasource does not support limit pushdown, I think the display of limit should not be shown in the datasource scan plan. This would be considered that limit is already supported in pushdown, but this is not the case in reality.

Jay-ju avatar Jun 26 '25 08:06 Jay-ju

@Jay-ju We've found that it can be useful to see that some sort of limit pushdown is done on the source, even if its not the full one. It's good to verify if an optimization has occurred or not, which is one of the reasons to have a physical plan. However, I do admit that its confusing, so maybe we could rename it from Pushdowns { projection: [vector], limit: 1 } to Pushdowns { projection: [vector], local_limit: 1 } or global_limit. That way, its clearer. Wdyt?

srilman avatar Jun 27 '25 01:06 srilman

@srilman I want to confirm one thing, which is the meaning of the expression "local_limit". For example, when I do read_lance, each fragment in the scannode is a task. As I understand it currently, there is no pushed - down limit for each fragment, that is, each fragment is read completely. Here, the meaning of "local_limit" is that all fragments are scanned, but when the scannode outputs, each task only outputs the number of rows specified by the limit. If this is the case, my suggestion is that the limit should not be written in the pushdown, but there should be a separate partial limit. And downstream of the scannode is a global limit.

Jay-ju avatar Jun 27 '25 02:06 Jay-ju

https://github.com/Eventual-Inc/Daft/pull/4612/files @srilman @colin-ho Similar to the implementation in this file, if the datasource itself does not support pushdown, the pushdown will not display the filter. Instead, a global filter will be applied externally. Should the limit be similar?

Jay-ju avatar Jun 27 '25 02:06 Jay-ju

@Jay-ju I believe the current lance datasource does support a partial limit, so each fragment / task reads the limit number of rows. If there are 100 tasks, then we read 100 * limit rows before performing the global limit.

srilman avatar Jun 27 '25 02:06 srilman

@srilman ImageCurrently, lance does not support partial limit. I have implemented it here.https://github.com/Eventual-Inc/Daft/pull/4622/files

Jay-ju avatar Jun 27 '25 02:06 Jay-ju

Image

Jay-ju avatar Jun 27 '25 02:06 Jay-ju

@srilman Could you please take a look at the discussion here again? Because there are some historical unit tests (UTs) in https://github.com/Eventual-Inc/Daft/pull/4612. If possible, I need to change the expectations of the UTs here. If you think the current behavior should be retained, I will prioritize the display of filters in pushdown. Ford that cannot be pushed down will not be shown in pushdown for now.

Jay-ju avatar Jun 28 '25 03:06 Jay-ju

@srilman any updates here? Thanks

rchowell avatar Aug 05 '25 23:08 rchowell