ibis icon indicating copy to clipboard operation
ibis copied to clipboard

bug: joins use same table reference if two tables are structurally the same

Open NickCrews opened this issue 1 year ago • 6 comments

What happened?

import ibis
from ibis import _

t = ibis.memtable({"x": [1, 2, 3]})
t2 = t.mutate(x=_.x + 1)
t3 = t.mutate(x=_.x + 1)
t4 = t.mutate(x=_.x + 2)
print(id(t2), id(t3), id(t4))
# all three are distinct tables
# even though t2 and t3 are structurally the same
# 11073092688 11095764944 11130247120


j1 = ibis.join(t2, t3, [(t2.x, t3.x)])
j2 = ibis.join(t2, t4, [(t2.x, t4.x)])
ibis.to_sql(j1)
ibis.to_sql(j2)

j1's sql. Note the ON t0.x = t0.x join condition. The rest of the expression could stay the same, as long as that gets fixed to ON t0.x = t0.x. Currently this does a NOOP filter, and then a cross product.

WITH t0 AS (
  SELECT
    t2.x + CAST(1 AS TINYINT) AS x
  FROM ibis_pandas_memtable_wxcdeev74re7vns2b777suwche AS t2
)
SELECT
  t0.x,
  t1.x AS x_right
FROM t0
JOIN t0 AS t1
  ON t0.x = t0.x

j2's sql. Note that two different CTEs are getting generated, so there is no opportunity for mangling to happen.

WITH t1 AS (
  SELECT
    t2.x + CAST(1 AS TINYINT) AS x
  FROM ibis_pandas_memtable_wxcdeev74re7vns2b777suwche AS t2
), t0 AS (
  SELECT
    t2.x + CAST(2 AS TINYINT) AS x
  FROM ibis_pandas_memtable_wxcdeev74re7vns2b777suwche AS t2
)
SELECT
  t1.x
FROM t1
JOIN t0
  ON t1.x = t0.x

What I would expect:

as long as id(t1) != id(t2), then I would expect that ibis.join(t1, t2) would lead to a join with a condition like ON t0.x = t1.x

I can of course get around this by calling .view() on one of them. But my understanding is that this should only be needed if id(t1) == id(t2).

This makes me wonder, can we get around this just by adding in def join()

if id(right) != id(left):
    left = left.view()

Would this ever do something that a user wouldn't want?

What version of ibis are you using?

8.0.0

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

NickCrews avatar Feb 09 '24 02:02 NickCrews

Hey @NickCrews -- the-epic-split includes a complete rewrite and rethinking of our join operations. This is the output on that branch of j1 and j2:

[ins] In [2]: ibis.to_sql(j1)
Out[2]: 
WITH "t1" AS (
  SELECT
    "t0"."x" + CAST(1 AS TINYINT) AS "x"
  FROM "ibis_pandas_memtable_tsatyzmphras3lzufddbppojeu" AS "t0"
)
SELECT
  "t3"."x"
FROM "t1" AS "t3"
INNER JOIN "t1" AS "t4"
  ON "t3"."x" = "t4"."x"

[ins] In [3]: ibis.to_sql(j2)
Out[3]: 
SELECT
  "t3"."x"
FROM (
  SELECT
    "t0"."x" + CAST(1 AS TINYINT) AS "x"
  FROM "ibis_pandas_memtable_tsatyzmphras3lzufddbppojeu" AS "t0"
) AS "t3"
INNER JOIN (
  SELECT
    "t0"."x" + CAST(2 AS TINYINT) AS "x"
  FROM "ibis_pandas_memtable_tsatyzmphras3lzufddbppojeu" AS "t0"
) AS "t4"
  ON "t3"."x" = "t4"."x"

gforsyth avatar Feb 09 '24 13:02 gforsyth

Nice! Ok then sounds like this is already fixed. Was there a test added for this in that branch?

NickCrews avatar Feb 09 '24 18:02 NickCrews

I'm not sure whether there's a test for this specific case -- might have to spelunk a bit.

gforsyth avatar Feb 09 '24 18:02 gforsyth

we did add a LOT of tests for different kinds of join chaining, so probably, but I can't confirm that with full confidence.

gforsyth avatar Feb 09 '24 18:02 gforsyth

We have an extensive set of tests for this behavior and many other tricky join issues in test_newrels.py.

cpcloud avatar Feb 12 '24 19:02 cpcloud

@kszucs Is going to add a link for the fix here.

cpcloud avatar Feb 12 '24 19:02 cpcloud