ibis
ibis copied to clipboard
bug: joins use same table reference if two tables are structurally the same
What happened?
import ibis
from ibis import _
t = ibis.memtable({"x": [1, 2, 3]})
t2 = t.mutate(x=_.x + 1)
t3 = t.mutate(x=_.x + 1)
t4 = t.mutate(x=_.x + 2)
print(id(t2), id(t3), id(t4))
# all three are distinct tables
# even though t2 and t3 are structurally the same
# 11073092688 11095764944 11130247120
j1 = ibis.join(t2, t3, [(t2.x, t3.x)])
j2 = ibis.join(t2, t4, [(t2.x, t4.x)])
ibis.to_sql(j1)
ibis.to_sql(j2)
j1's sql. Note the ON t0.x = t0.x join condition. The rest of the expression could stay the same, as long as that gets fixed to ON t0.x = t0.x. Currently this does a NOOP filter, and then a cross product.
WITH t0 AS (
SELECT
t2.x + CAST(1 AS TINYINT) AS x
FROM ibis_pandas_memtable_wxcdeev74re7vns2b777suwche AS t2
)
SELECT
t0.x,
t1.x AS x_right
FROM t0
JOIN t0 AS t1
ON t0.x = t0.x
j2's sql. Note that two different CTEs are getting generated, so there is no opportunity for mangling to happen.
WITH t1 AS (
SELECT
t2.x + CAST(1 AS TINYINT) AS x
FROM ibis_pandas_memtable_wxcdeev74re7vns2b777suwche AS t2
), t0 AS (
SELECT
t2.x + CAST(2 AS TINYINT) AS x
FROM ibis_pandas_memtable_wxcdeev74re7vns2b777suwche AS t2
)
SELECT
t1.x
FROM t1
JOIN t0
ON t1.x = t0.x
What I would expect:
as long as id(t1) != id(t2), then I would expect that ibis.join(t1, t2) would lead to a join with a condition like ON t0.x = t1.x
I can of course get around this by calling .view() on one of them. But my understanding is that this should only be needed if id(t1) == id(t2).
This makes me wonder, can we get around this just by adding in def join()
if id(right) != id(left):
left = left.view()
Would this ever do something that a user wouldn't want?
What version of ibis are you using?
8.0.0
What backend(s) are you using, if any?
No response
Relevant log output
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Hey @NickCrews -- the-epic-split includes a complete rewrite and rethinking of our join operations. This is the output on that branch of j1 and j2:
[ins] In [2]: ibis.to_sql(j1)
Out[2]:
WITH "t1" AS (
SELECT
"t0"."x" + CAST(1 AS TINYINT) AS "x"
FROM "ibis_pandas_memtable_tsatyzmphras3lzufddbppojeu" AS "t0"
)
SELECT
"t3"."x"
FROM "t1" AS "t3"
INNER JOIN "t1" AS "t4"
ON "t3"."x" = "t4"."x"
[ins] In [3]: ibis.to_sql(j2)
Out[3]:
SELECT
"t3"."x"
FROM (
SELECT
"t0"."x" + CAST(1 AS TINYINT) AS "x"
FROM "ibis_pandas_memtable_tsatyzmphras3lzufddbppojeu" AS "t0"
) AS "t3"
INNER JOIN (
SELECT
"t0"."x" + CAST(2 AS TINYINT) AS "x"
FROM "ibis_pandas_memtable_tsatyzmphras3lzufddbppojeu" AS "t0"
) AS "t4"
ON "t3"."x" = "t4"."x"
Nice! Ok then sounds like this is already fixed. Was there a test added for this in that branch?
I'm not sure whether there's a test for this specific case -- might have to spelunk a bit.
we did add a LOT of tests for different kinds of join chaining, so probably, but I can't confirm that with full confidence.
We have an extensive set of tests for this behavior and many other tricky join issues in test_newrels.py.
@kszucs Is going to add a link for the fix here.