datafusion
datafusion copied to clipboard
Some In/Exists Subqueries will generate wrong PhysicalPlan
Describe the bug
Datafusion can't execute non-correlated subquery now, like in/exists.
So these queries should return NotImplemented("Physical plan does not support logical expression In/Exists error. But currently the filter will be pushdown to the TableScan.
> explain select * from t1 where exists(select 1 from t2 where t2.t2_id > 0);
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: t1.t1_id, t1.t1_name, t1.t1_int |
| | TableScan: t1 projection=[t1_id, t1_name, t1_int], full_filters=[EXISTS (<subquery>)] |
| | Subquery: |
| | Projection: Int64(1) |
| | Filter: CAST(t2.t2_id AS Int64) > Int64(0) |
| | TableScan: t2 |
| physical_plan | ProjectionExec: expr=[t1_id@0 as t1_id, t1_name@1 as t1_name, t1_int@2 as t1_int] |
| | CsvExec: files={1 group: [[home/work/tools/datafusion-test-data/join-context/t1.csv]]}, has_header=false, limit=None, projection=[t1_id, t1_name, t1_int] |
| | |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
For the above query, the subquery filter will be lost in the final physical_plan.
To Reproduce As above.
Expected behavior
The query should return NotImplemented("Physical plan does not support logical expression In/Exists error.
Additional context Add any other context about the problem here.
I remember the filters will be added to wrong side. I can work on this and provide a fix.
Here is a full reproducer:
echo "1,2" > data.csv
The run sql
> create external table t1(t1_id int, t2_name int) stored as csv location 'data.csv';
0 row(s) fetched.
Elapsed 0.019 seconds.
> create external table t2(t2_id int, t2_name int) stored as csv location 'data.csv';
0 row(s) fetched.
Elapsed 0.007 seconds.
> explain select * from t1 where exists(select 1 from t2 where t2.t2_id > 0);
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | TableScan: t1 projection=[t1_id, t2_name], full_filters=[EXISTS (<subquery>)] |
| | Subquery: |
| | Projection: Int64(1) |
| | Filter: CAST(t2.t2_id AS Int64) > Int64(0) |
| | TableScan: t2 |
| physical_plan | CsvExec: file_groups={1 group: [[Users/andrewlamb/Software/datafusion2/datafusion-cli/data.csv]]}, projection=[t1_id, t2_name], has_header=true |
| | |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.019 seconds.
take