ibis bug(backends): unexpected chained join behavior

Chained joins in pandas backend don't behave as expected, and yield an error.

MWE:

import ibis
import pandas as pd

test_df1 = pd.DataFrame({
    'id': ['1', '1'],
    'value': ['a', 'a']
})

test_df2 = pd.DataFrame({
    'id': ['1', '1'],
    'value': ['z', 'z']
})

test_df3 = pd.DataFrame({
    'id': ['1', '1'],
    'value': ['z1', 'z1']
})

conn = ibis.pandas.connect({'df1': test_df1,
                            'df2': test_df2,
                            'df3': test_df3})

t1 = conn.table('df1')
t2 = conn.table('df2')
t3 = conn.table('df3')

expr1 = t1.join(t2, t1.id == t2.id)
expr2 = t1.join(t2, t1.id == t2.id).join(t3, t1.id == t3.id)

#works
print('\n'+expr1.execute())

#fails
print('\n'+expr2.execute())

Error:

KeyError: PandasTable(name='df1', schema=ibis.Schema {
  id     string
  value  string
}, source=<ibis.backends.pandas.Backend object at 0x7f4edbb87e80>)

Does the pandas backend support chained joins / is there a different syntax required than the docs suggest here -> https://ibis-project.org/docs/3.1.0/ibis-for-sql-programmers/?h=left_join#multiple-joins ?

Aug 02 '22 16:08 brendanbikes

@brendanbikes Thanks for the issue.

I'll take a closer look at this today and see if there's a straightforward way to fix the problem.

Really appreciate the MWE!

Aug 03 '22 10:08 cpcloud

This also fails for the SQLite backend:

import ibis

sconn = ibis.sqlite.connect('test_db.db')

import sqlite3

conn = sqlite3.connect('test_db.db')

import pandas as pd

test_df1 = pd.DataFrame({
    'id': ['1', '1'],
    'value': ['a', 'a']
})

test_df2 = pd.DataFrame({
    'id': ['1', '1'],
    'value': ['z', 'z']
})

test_df3 = pd.DataFrame({
    'id': ['1', '1'],
    'value': ['z1', 'z1']
})

test_df1.to_sql('df1', conn, if_exists='replace')
test_df2.to_sql('df2', conn, if_exists='replace')
test_df3.to_sql('df3', conn, if_exists='replace')

sconn.list_tables()

t1 = sconn.table('df1')
t2 = sconn.table('df2')
t3 = sconn.table('df3')

result = t1.join(t2, t1.id==t2.id).join(t3, t1.id==t3.id)

result.execute()

failure:

OperationalError: (sqlite3.OperationalError) no such column: main.df1.id
[SQL: SELECT t0.index_x, t0.id_x, t0.value_x, t0.index_y, t0.id_y, t0.value_y, t1."index", t1.id, t1.value 
FROM (SELECT t2."index" AS index_x, t2.id AS id_x, t2.value AS value_x, t3."index" AS index_y, t3.id AS id_y, t3.value AS value_y 
FROM main.df1 AS t2 JOIN main.df2 AS t3 ON t2.id = t3.id) AS t0 JOIN main.df3 AS t1 ON main.df1.id = t1.id
 LIMIT ? OFFSET ?]
[parameters: (10000, 0)]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

I ran into this issue myself here: https://github.com/ibis-project/ibis/discussions/4031

Aug 03 '22 17:08 p-a-a-a-trick

Funnily enough, the duckdb backend yields a different error:

RuntimeError: Binder Error: Referenced table "df1" not found!
Candidate tables: "t1", "t0"
LINE 3: ..._y, t0.value_y, t1.id, t1.value 
FROM (SELECT t2.id AS id_x, t2.value AS value_x, t3.id AS id_y, t3.value AS value_y 
FROM df1 AS t2 JOIN df2 AS t3 ON t2.id = t3.id) AS t0 JOIN df3 AS t1 ON df1.id = t1.id

block:

import os
import ibis
import pandas as pd

test_df1 = pd.DataFrame({
    'id': ['1', '1'],
    'value': ['a', 'a']
})

test_df2 = pd.DataFrame({
    'id': ['1', '1'],
    'value': ['z', 'z']
})

test_df3 = pd.DataFrame({
    'id': ['1', '1'],
    'value': ['z1', 'z1']
})

if not os.path.exists('test/'):
    os.mkdir('test/')
test_df1.to_parquet('test/df1.parquet')
test_df2.to_parquet('test/df2.parquet')
test_df3.to_parquet('test/df3.parquet')

dconn = ibis.connect('duckdb://:memory:')

dconn.register('test/df1.parquet', 'df1')
dconn.register('test/df2.parquet', 'df2')
dconn.register('test/df3.parquet', 'df3')

t1 = dconn.table('df1')
t2 = dconn.table('df2')
t3 = dconn.table('df3')

result = t1.join(t2, t1.id==t2.id).join(t3, t1.id==t3.id)

result.execute()

Aug 04 '22 14:08 p-a-a-a-trick

I looked into this, and the problem is more challenging than it may appear. We'll need to change the internal representation of join operations to be variadic so I'm going to punt this to 4.x.

For now, you'll have to create a variable and use that as the table from which to refer to columns in a subsequent join.

I know that's not ideal, but we will fix this eventually. It's important for our UX that this works.

Sep 02 '22 09:09 cpcloud

In the meantime you should be able to write

from ibis import _
result = t1.join(t2, t1.id==t2.id).join(t3, _.id==t3.id)

Note the _, it refers to everything to the left of the second join call.

Nov 07 '22 21:11 cpcloud

I wonder if this is a due to the same root cause, or if this is a Snowflake specific issue:

purchases = conn.table("purchases", schema="staging")
venues = conn.table("venues", schema="public")
cities = conn.table("cities", schema="public")

r1 = purchases.join(venues, purchases.VENUE_ID == venues.VENUE_ID) # at this point `CITY`-columns in both tables are renamed `CITY_x` and `CITY_y` 

r2 = r1.join(cities, purchases.CITY==cities.NAME) # this passes without issue. 
r2.execute()  # This fails

# Error:
# SQL compilation error: error line 3 at position 128 invalid identifier 'STAGING.PURCHASES.CITY'

The 'STAGING.PURCHASES.CITY' column does exist, but the SQL that gets produced is not Snowflake-compatible for some reason.

The generated SQL looks like this

SELECT
    t0."ID",
    t0."CITY_x",
    t0."COUNTRY_x",
    t0."VENUE_ID",
    t0."NAME" AS "NAME_x",
    t0."CITY_y",
    t0."COUNTRY_y",
    t1."NAME" AS "NAME_y",
    <REDACTED COLUMNS>
FROM
    (
        SELECT
            t2."ID" AS "ID",
            t2."CITY" AS "CITY_x",
            t2."COUNTRY" AS "COUNTRY_x",
            t2."VENUE_ID" AS "VENUE_ID",
            t3."NAME" AS "NAME",
            t3."CITY" AS "CITY_y",
            t3."COUNTRY" AS "COUNTRY_y",
            <REDACTED COLUMNS>
        FROM
            staging.purchases AS t2
            JOIN public.venues AS t3 ON t2."VENUE_ID" = t3."VENUE_ID"
    ) AS t0
    JOIN public.cities AS t1 ON staging.purchases."CITY" = t1."NAME"  // this is the problem

Doing r2 = r1.join(cities, r1.CITY_x==cities.NAME).execute() works, but is not very elegant.

For context, what I'm trying to replicate is something like:

select *
from staging.purchases
join venues on public.venues.venue_id = staging.purchases.venue_id
join cities on public.cities.name = staging.purchases.city

Let me know if you think this is a different problem, in which case I'll create a separate issue for it.

Nov 16 '22 10:11 frsann

@frsann I got my DuckDB and postgres blocks to work using the underscore API - can you try that real quick and let us know?

FWIW I think it's the same issue, but I'm not 100%.

e.g.

# You can alias this if you need the default _
# there's an ongoing discussion about it here: https://github.com/ibis-project/ibis/issues/4704
from ibis import _

purchases = conn.table("purchases", schema="staging")
venues = conn.table("venues", schema="public")
cities = conn.table("cities", schema="public")

r1 = purchases.join(
  venues
  ,purchases.VENUE_ID == venues.VENUE_ID
).join(
  cities
  ,_.CITY==cities.NAME
)
r1.execute()

Nov 16 '22 15:11 p-a-a-a-trick

In this case I need to take into account the renamed column purchases.join(venues, purchases.VENUE_ID == venues.VENUE_ID).join(cities, ibis._.CITY_x==cities.NAME).execute()

Nov 16 '22 19:11 frsann

Interestingly, this appears to be related to overlapping column names in some way:

Not Working

t1 = ibis.table(dict(id="int64"), name="t1")
t2 = ibis.table(dict(id="int64"), name="t2")
t3 = ibis.table(dict(id="int64"), name="t3")

expr = t1.join(t2, t1.id == t2.id).join(t3, t1.id == t3.id)
ibis.show_sql(expr, dialect="duckdb")

SELECT
  t0.id AS id_x,
  t1.id AS id_y
FROM (
  SELECT
    t2.id AS id
  FROM t1 AS t2
  JOIN t2 AS t3
    ON t2.id = t3.id
) AS t0
JOIN t3 AS t1
  ON t2.id = t1.id

Working

t1 = ibis.table(dict(id1="int64"), name="t1")
t2 = ibis.table(dict(id2="int64"), name="t2")
t3 = ibis.table(dict(id3="int64"), name="t3")

expr = t1.join(t2, t1.id1 == t2.id2).join(t3, t1.id1 == t3.id3)
ibis.show_sql(expr, dialect="duckdb")

SELECT
  t0.id1,
  t1.id2,
  t2.id3
FROM t1 AS t0
JOIN t2 AS t1
  ON t0.id1 = t1.id2
JOIN t3 AS t2
  ON t0.id1 = t2.id3

Nov 29 '22 16:11 cpcloud

This seems very similar to my frustrations in #5514

Feb 13 '23 18:02 judahrand

Folks, thanks for your patience. I know it's been a while and certainly a long time coming but we're going to fix this in #7580 along with a host of other issues related to chaining joins.

Bear with us because it'll be a couple of releases but we'll get it done!

Dec 06 '23 21:12 cpcloud

Modulo the overlapping column name deduplication (which we are working on bringing back before releasing this functionality), this is addressed in the-epic-split:

In [4]: import ibis
   ...: import pandas as pd
   ...:
   ...: test_df1 = pd.DataFrame({"id1": ["1", "1"], "value1": ["a", "a"]})
   ...:
   ...: test_df2 = pd.DataFrame({"id2": ["1", "1"], "value2": ["z", "z"]})
   ...:
   ...: test_df3 = pd.DataFrame({"id3": ["1", "1"], "value3": ["z1", "z1"]})
   ...:
   ...: conn = ibis.pandas.connect({"df1": test_df1, "df2": test_df2, "df3": test_df3})
   ...:
   ...: t1 = conn.table("df1")
   ...: t2 = conn.table("df2")
   ...: t3 = conn.table("df3")
   ...:
   ...: expr1 = t1.join(t2, t1.id1 == t2.id2)
   ...: expr2 = t1.join(t2, t1.id1 == t2.id2).join(t3, t1.id1 == t3.id3)
   ...:
   ...: # works
   ...: print(expr1.execute())
   ...:
   ...: # fails
   ...: print(expr2.execute())
  id1 value1 id2 value2
0   1      a   1      z
1   1      a   1      z
2   1      a   1      z
3   1      a   1      z
  id1 value1 id2 value2 id3 value3
0   1      a   1      z   1     z1
1   1      a   1      z   1     z1
2   1      a   1      z   1     z1
3   1      a   1      z   1     z1
4   1      a   1      z   1     z1
5   1      a   1      z   1     z1
6   1      a   1      z   1     z1
7   1      a   1      z   1     z1

Jan 27 '24 15:01 cpcloud