bug(backends): unexpected chained join behavior
Chained joins in pandas backend don't behave as expected, and yield an error.
MWE:
import ibis
import pandas as pd
test_df1 = pd.DataFrame({
'id': ['1', '1'],
'value': ['a', 'a']
})
test_df2 = pd.DataFrame({
'id': ['1', '1'],
'value': ['z', 'z']
})
test_df3 = pd.DataFrame({
'id': ['1', '1'],
'value': ['z1', 'z1']
})
conn = ibis.pandas.connect({'df1': test_df1,
'df2': test_df2,
'df3': test_df3})
t1 = conn.table('df1')
t2 = conn.table('df2')
t3 = conn.table('df3')
expr1 = t1.join(t2, t1.id == t2.id)
expr2 = t1.join(t2, t1.id == t2.id).join(t3, t1.id == t3.id)
#works
print('\n'+expr1.execute())
#fails
print('\n'+expr2.execute())
Error:
KeyError: PandasTable(name='df1', schema=ibis.Schema {
id string
value string
}, source=<ibis.backends.pandas.Backend object at 0x7f4edbb87e80>)
Does the pandas backend support chained joins / is there a different syntax required than the docs suggest here -> https://ibis-project.org/docs/3.1.0/ibis-for-sql-programmers/?h=left_join#multiple-joins ?
@brendanbikes Thanks for the issue.
I'll take a closer look at this today and see if there's a straightforward way to fix the problem.
Really appreciate the MWE!
This also fails for the SQLite backend:
import ibis
sconn = ibis.sqlite.connect('test_db.db')
import sqlite3
conn = sqlite3.connect('test_db.db')
import pandas as pd
test_df1 = pd.DataFrame({
'id': ['1', '1'],
'value': ['a', 'a']
})
test_df2 = pd.DataFrame({
'id': ['1', '1'],
'value': ['z', 'z']
})
test_df3 = pd.DataFrame({
'id': ['1', '1'],
'value': ['z1', 'z1']
})
test_df1.to_sql('df1', conn, if_exists='replace')
test_df2.to_sql('df2', conn, if_exists='replace')
test_df3.to_sql('df3', conn, if_exists='replace')
sconn.list_tables()
t1 = sconn.table('df1')
t2 = sconn.table('df2')
t3 = sconn.table('df3')
result = t1.join(t2, t1.id==t2.id).join(t3, t1.id==t3.id)
result.execute()
failure:
OperationalError: (sqlite3.OperationalError) no such column: main.df1.id
[SQL: SELECT t0.index_x, t0.id_x, t0.value_x, t0.index_y, t0.id_y, t0.value_y, t1."index", t1.id, t1.value
FROM (SELECT t2."index" AS index_x, t2.id AS id_x, t2.value AS value_x, t3."index" AS index_y, t3.id AS id_y, t3.value AS value_y
FROM main.df1 AS t2 JOIN main.df2 AS t3 ON t2.id = t3.id) AS t0 JOIN main.df3 AS t1 ON main.df1.id = t1.id
LIMIT ? OFFSET ?]
[parameters: (10000, 0)]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
I ran into this issue myself here: https://github.com/ibis-project/ibis/discussions/4031
Funnily enough, the duckdb backend yields a different error:
RuntimeError: Binder Error: Referenced table "df1" not found!
Candidate tables: "t1", "t0"
LINE 3: ..._y, t0.value_y, t1.id, t1.value
FROM (SELECT t2.id AS id_x, t2.value AS value_x, t3.id AS id_y, t3.value AS value_y
FROM df1 AS t2 JOIN df2 AS t3 ON t2.id = t3.id) AS t0 JOIN df3 AS t1 ON df1.id = t1.id
block:
import os
import ibis
import pandas as pd
test_df1 = pd.DataFrame({
'id': ['1', '1'],
'value': ['a', 'a']
})
test_df2 = pd.DataFrame({
'id': ['1', '1'],
'value': ['z', 'z']
})
test_df3 = pd.DataFrame({
'id': ['1', '1'],
'value': ['z1', 'z1']
})
if not os.path.exists('test/'):
os.mkdir('test/')
test_df1.to_parquet('test/df1.parquet')
test_df2.to_parquet('test/df2.parquet')
test_df3.to_parquet('test/df3.parquet')
dconn = ibis.connect('duckdb://:memory:')
dconn.register('test/df1.parquet', 'df1')
dconn.register('test/df2.parquet', 'df2')
dconn.register('test/df3.parquet', 'df3')
t1 = dconn.table('df1')
t2 = dconn.table('df2')
t3 = dconn.table('df3')
result = t1.join(t2, t1.id==t2.id).join(t3, t1.id==t3.id)
result.execute()
I looked into this, and the problem is more challenging than it may appear. We'll need to change the internal representation of join operations to be variadic so I'm going to punt this to 4.x.
For now, you'll have to create a variable and use that as the table from which to refer to columns in a subsequent join.
I know that's not ideal, but we will fix this eventually. It's important for our UX that this works.
In the meantime you should be able to write
from ibis import _
result = t1.join(t2, t1.id==t2.id).join(t3, _.id==t3.id)
Note the _, it refers to everything to the left of the second join call.
I wonder if this is a due to the same root cause, or if this is a Snowflake specific issue:
purchases = conn.table("purchases", schema="staging")
venues = conn.table("venues", schema="public")
cities = conn.table("cities", schema="public")
r1 = purchases.join(venues, purchases.VENUE_ID == venues.VENUE_ID) # at this point `CITY`-columns in both tables are renamed `CITY_x` and `CITY_y`
r2 = r1.join(cities, purchases.CITY==cities.NAME) # this passes without issue.
r2.execute() # This fails
# Error:
# SQL compilation error: error line 3 at position 128 invalid identifier 'STAGING.PURCHASES.CITY'
The 'STAGING.PURCHASES.CITY' column does exist, but the SQL that gets produced is not Snowflake-compatible for some reason.
The generated SQL looks like this
SELECT
t0."ID",
t0."CITY_x",
t0."COUNTRY_x",
t0."VENUE_ID",
t0."NAME" AS "NAME_x",
t0."CITY_y",
t0."COUNTRY_y",
t1."NAME" AS "NAME_y",
<REDACTED COLUMNS>
FROM
(
SELECT
t2."ID" AS "ID",
t2."CITY" AS "CITY_x",
t2."COUNTRY" AS "COUNTRY_x",
t2."VENUE_ID" AS "VENUE_ID",
t3."NAME" AS "NAME",
t3."CITY" AS "CITY_y",
t3."COUNTRY" AS "COUNTRY_y",
<REDACTED COLUMNS>
FROM
staging.purchases AS t2
JOIN public.venues AS t3 ON t2."VENUE_ID" = t3."VENUE_ID"
) AS t0
JOIN public.cities AS t1 ON staging.purchases."CITY" = t1."NAME" // this is the problem
Doing r2 = r1.join(cities, r1.CITY_x==cities.NAME).execute() works, but is not very elegant.
For context, what I'm trying to replicate is something like:
select *
from staging.purchases
join venues on public.venues.venue_id = staging.purchases.venue_id
join cities on public.cities.name = staging.purchases.city
Let me know if you think this is a different problem, in which case I'll create a separate issue for it.
@frsann I got my DuckDB and postgres blocks to work using the underscore API - can you try that real quick and let us know?
FWIW I think it's the same issue, but I'm not 100%.
e.g.
# You can alias this if you need the default _
# there's an ongoing discussion about it here: https://github.com/ibis-project/ibis/issues/4704
from ibis import _
purchases = conn.table("purchases", schema="staging")
venues = conn.table("venues", schema="public")
cities = conn.table("cities", schema="public")
r1 = purchases.join(
venues
,purchases.VENUE_ID == venues.VENUE_ID
).join(
cities
,_.CITY==cities.NAME
)
r1.execute()
In this case I need to take into account the renamed column purchases.join(venues, purchases.VENUE_ID == venues.VENUE_ID).join(cities, ibis._.CITY_x==cities.NAME).execute()
Interestingly, this appears to be related to overlapping column names in some way:
Not Working
t1 = ibis.table(dict(id="int64"), name="t1")
t2 = ibis.table(dict(id="int64"), name="t2")
t3 = ibis.table(dict(id="int64"), name="t3")
expr = t1.join(t2, t1.id == t2.id).join(t3, t1.id == t3.id)
ibis.show_sql(expr, dialect="duckdb")
SELECT
t0.id AS id_x,
t1.id AS id_y
FROM (
SELECT
t2.id AS id
FROM t1 AS t2
JOIN t2 AS t3
ON t2.id = t3.id
) AS t0
JOIN t3 AS t1
ON t2.id = t1.id
Working
t1 = ibis.table(dict(id1="int64"), name="t1")
t2 = ibis.table(dict(id2="int64"), name="t2")
t3 = ibis.table(dict(id3="int64"), name="t3")
expr = t1.join(t2, t1.id1 == t2.id2).join(t3, t1.id1 == t3.id3)
ibis.show_sql(expr, dialect="duckdb")
SELECT
t0.id1,
t1.id2,
t2.id3
FROM t1 AS t0
JOIN t2 AS t1
ON t0.id1 = t1.id2
JOIN t3 AS t2
ON t0.id1 = t2.id3
This seems very similar to my frustrations in #5514
Folks, thanks for your patience. I know it's been a while and certainly a long time coming but we're going to fix this in #7580 along with a host of other issues related to chaining joins.
Bear with us because it'll be a couple of releases but we'll get it done!
Modulo the overlapping column name deduplication (which we are working on bringing back before releasing this functionality), this is addressed in the-epic-split:
In [4]: import ibis
...: import pandas as pd
...:
...: test_df1 = pd.DataFrame({"id1": ["1", "1"], "value1": ["a", "a"]})
...:
...: test_df2 = pd.DataFrame({"id2": ["1", "1"], "value2": ["z", "z"]})
...:
...: test_df3 = pd.DataFrame({"id3": ["1", "1"], "value3": ["z1", "z1"]})
...:
...: conn = ibis.pandas.connect({"df1": test_df1, "df2": test_df2, "df3": test_df3})
...:
...: t1 = conn.table("df1")
...: t2 = conn.table("df2")
...: t3 = conn.table("df3")
...:
...: expr1 = t1.join(t2, t1.id1 == t2.id2)
...: expr2 = t1.join(t2, t1.id1 == t2.id2).join(t3, t1.id1 == t3.id3)
...:
...: # works
...: print(expr1.execute())
...:
...: # fails
...: print(expr2.execute())
id1 value1 id2 value2
0 1 a 1 z
1 1 a 1 z
2 1 a 1 z
3 1 a 1 z
id1 value1 id2 value2 id3 value3
0 1 a 1 z 1 z1
1 1 a 1 z 1 z1
2 1 a 1 z 1 z1
3 1 a 1 z 1 z1
4 1 a 1 z 1 z1
5 1 a 1 z 1 z1
6 1 a 1 z 1 z1
7 1 a 1 z 1 z1