The display of pushdown in explain regarding limit and filter is incorrect.
Describe the bug
Question 1: The limit was not pushed down to the scan operator, but the limit expression was still shown in the physical plan.
Question 2: The datasource does not support filter pushdown, but the pushdown still includes filter pushdown.
To Reproduce
No response
Expected behavior
No response
Component(s)
Expressions
Additional context
No response
Question 1: The limit was not pushed down to the scan operator, but the limit expression was still shown in the physical plan.
- This is actually correct. The limit expression in the physical plan is a global limit, whereas the limits in the scan operator are local limits. Imagine we did
limit(1)and read 2 csv files with 1 row each. We push the limit(1) down into each file correctly, and they both return 1 row. If we didn't have the global limit, we will end up returning 2 rows. The global limit in the physical plan is therefore needed to apply the limit across all morsels / partitions of data read from all the files.
Question 2: The datasource does not support filter pushdown, but the pushdown still includes filter pushdown.
- Looks like there's work here to support filter pushdowns in lance: https://github.com/Eventual-Inc/Daft/discussions/4615
Question 1: The limit was not pushed down to the scan operator, but the limit expression was still shown in the physical plan.
- This is actually correct. The limit expression in the physical plan is a global limit, whereas the limits in the scan operator are local limits. Imagine we did
limit(1)and read 2 csv files with 1 row each. We push the limit(1) down into each file correctly, and they both return 1 row. If we didn't have the global limit, we will end up returning 2 rows. The global limit in the physical plan is therefore needed to apply the limit across all morsels / partitions of data read from all the files.Question 2: The datasource does not support filter pushdown, but the pushdown still includes filter pushdown.
- Looks like there's work here to support filter pushdowns in lance: Updating read_lance to support pushdowns. #4615
@colin-ho Get it. That is, in the datasource, there is a partial limit, which means the datasource supports limit. However, the current datasource in lance does not support limit. So what I added here is that it is necessary to display whether there should be a limit in the datasource according to whether the datasource supports it.
Question 1: The limit was not pushed down to the scan operator, but the limit expression was still shown in the physical plan.
- This is actually correct. The limit expression in the physical plan is a global limit, whereas the limits in the scan operator are local limits. Imagine we did
limit(1)and read 2 csv files with 1 row each. We push the limit(1) down into each file correctly, and they both return 1 row. If we didn't have the global limit, we will end up returning 2 rows. The global limit in the physical plan is therefore needed to apply the limit across all morsels / partitions of data read from all the files.Question 2: The datasource does not support filter pushdown, but the pushdown still includes filter pushdown.
- Looks like there's work here to support filter pushdowns in lance: Updating read_lance to support pushdowns. #4615
What I want to express here is not that the global limit is not needed. Instead, if the datasource does not support limit pushdown, I think the display of limit should not be shown in the datasource scan plan. This would be considered that limit is already supported in pushdown, but this is not the case in reality.
@Jay-ju We've found that it can be useful to see that some sort of limit pushdown is done on the source, even if its not the full one. It's good to verify if an optimization has occurred or not, which is one of the reasons to have a physical plan. However, I do admit that its confusing, so maybe we could rename it from Pushdowns { projection: [vector], limit: 1 } to Pushdowns { projection: [vector], local_limit: 1 } or global_limit. That way, its clearer. Wdyt?
@srilman I want to confirm one thing, which is the meaning of the expression "local_limit". For example, when I do read_lance, each fragment in the scannode is a task. As I understand it currently, there is no pushed - down limit for each fragment, that is, each fragment is read completely. Here, the meaning of "local_limit" is that all fragments are scanned, but when the scannode outputs, each task only outputs the number of rows specified by the limit. If this is the case, my suggestion is that the limit should not be written in the pushdown, but there should be a separate partial limit. And downstream of the scannode is a global limit.
https://github.com/Eventual-Inc/Daft/pull/4612/files @srilman @colin-ho Similar to the implementation in this file, if the datasource itself does not support pushdown, the pushdown will not display the filter. Instead, a global filter will be applied externally. Should the limit be similar?
@Jay-ju I believe the current lance datasource does support a partial limit, so each fragment / task reads the limit number of rows. If there are 100 tasks, then we read 100 * limit rows before performing the global limit.
@srilman Currently, lance does not support partial limit. I have implemented it here.https://github.com/Eventual-Inc/Daft/pull/4622/files
@srilman Could you please take a look at the discussion here again? Because there are some historical unit tests (UTs) in https://github.com/Eventual-Inc/Daft/pull/4612. If possible, I need to change the expectations of the UTs here. If you think the current behavior should be retained, I will prioritize the display of filters in pushdown. Ford that cannot be pushed down will not be shown in pushdown for now.
@srilman any updates here? Thanks