polars
polars copied to clipboard
SQL cross join incorrect filter plan
Checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
df = pl.DataFrame({
"id":['1','2','3']
})
df2 = pl.DataFrame({
"id":['4','5','6']
})
ctx = pl.SQLContext(df=df, df2=df2)
ctx.execute(
"""
SELECT *
FROM df
CROSS JOIN df2
WHERE df.id = df2.id
""",
eager=False,
)
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)
SELECT [col("id"), col("id_right")] FROM
FILTER [(col("id")) == (col("id"))] FROM
CROSS JOIN:
LEFT PLAN ON: []
DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
RIGHT PLAN ON: []
DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
END CROSS JOIN
Log output
No response
Issue description
The filter plan is incorrect on the right-hand side of the join. In the select you can see the same columns got renamed to _right, but in the filter it doesn't properly recognize df2.id as df2.id_right. When I pass id_right it works.
Expected behavior
Pass the correct filter down in the plan.
Installed versions
--------Version info---------
Polars: 0.19.10
Index type: UInt32
Platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python: 3.10.12 (main, Aug 9 2023, 14:47:34) [GCC 9.4.0]
----Optional dependencies----
adbc_driver_sqlite: <not installed>
cloudpickle: 2.2.1
connectorx: <not installed>
deltalake: 0.12.0
fsspec: 2023.9.2
gevent: <not installed>
matplotlib: <not installed>
numpy: 1.23.5
openpyxl: <not installed>
pandas: 1.5.3
pyarrow: 11.0.0
pydantic: 1.10.13
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: 2.0.22
xlsx2csv: 0.8.1
xlsxwriter: 3.1.9
Is the output incorrect? Can you reproduce with the polars dataframe API?
Could you show me an example with the incorrect output.
@ritchie46
SQL output (wrong):
shape: (9, 2)| id | id_right |
|---|---|
| str | str |
| "1" | "4" |
| "1" | "5" |
| "1" | "6" |
| "2" | "4" |
| "2" | "5" |
| "2" | "6" |
| "3" | "4" |
| "3" | "5" |
| "3" | "6" |
DataFrame API output (correct)
df.join(df2, how="cross").filter(pl.col("id") == pl.col("id_right"))
| id | id_right |
|---|---|
| str | str |