datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Some In/Exists Subqueries will generate wrong PhysicalPlan

Open ygf11 opened this issue 2 years ago • 3 comments

Describe the bug

Datafusion can't execute non-correlated subquery now, like in/exists. So these queries should return NotImplemented("Physical plan does not support logical expression In/Exists error. But currently the filter will be pushdown to the TableScan.

> explain select * from t1 where exists(select 1 from t2 where t2.t2_id > 0);
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                        |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: t1.t1_id, t1.t1_name, t1.t1_int                                                                                                                 |
|               |   TableScan: t1 projection=[t1_id, t1_name, t1_int], full_filters=[EXISTS (<subquery>)]                                                                     |
|               |     Subquery:                                                                                                                                               |
|               |       Projection: Int64(1)                                                                                                                                  |
|               |         Filter: CAST(t2.t2_id AS Int64) > Int64(0)                                                                                                          |
|               |           TableScan: t2                                                                                                                                     |
| physical_plan | ProjectionExec: expr=[t1_id@0 as t1_id, t1_name@1 as t1_name, t1_int@2 as t1_int]                                                                           |
|               |   CsvExec: files={1 group: [[home/work/tools/datafusion-test-data/join-context/t1.csv]]}, has_header=false, limit=None, projection=[t1_id, t1_name, t1_int] |
|               |                                                                                                                                                             |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+

For the above query, the subquery filter will be lost in the final physical_plan.

To Reproduce As above.

Expected behavior The query should return NotImplemented("Physical plan does not support logical expression In/Exists error.

Additional context Add any other context about the problem here.

ygf11 avatar Feb 13 '23 12:02 ygf11

I remember the filters will be added to wrong side. I can work on this and provide a fix.

mingmwang avatar Feb 16 '23 05:02 mingmwang

Here is a full reproducer:

echo "1,2" > data.csv

The run sql


> create external table t1(t1_id int, t2_name int) stored as csv location 'data.csv';
0 row(s) fetched.
Elapsed 0.019 seconds.

> create external table t2(t2_id int, t2_name int) stored as csv location 'data.csv';
0 row(s) fetched.
Elapsed 0.007 seconds.

> explain select * from t1 where exists(select 1 from t2 where t2.t2_id > 0);
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                            |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | TableScan: t1 projection=[t1_id, t2_name], full_filters=[EXISTS (<subquery>)]                                                                   |
|               |   Subquery:                                                                                                                                     |
|               |     Projection: Int64(1)                                                                                                                        |
|               |       Filter: CAST(t2.t2_id AS Int64) > Int64(0)                                                                                                |
|               |         TableScan: t2                                                                                                                           |
| physical_plan | CsvExec: file_groups={1 group: [[Users/andrewlamb/Software/datafusion2/datafusion-cli/data.csv]]}, projection=[t1_id, t2_name], has_header=true |
|               |                                                                                                                                                 |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.019 seconds.

alamb avatar Oct 18 '24 20:10 alamb

take

Lordworms avatar Oct 18 '24 23:10 Lordworms