duckdb_iceberg icon indicating copy to clipboard operation
duckdb_iceberg copied to clipboard

Unable to do JOIN between to iceberg based tables

Open sarialalem1 opened this issue 1 year ago • 6 comments

Example of what I'm trying to do:

select * from iceberg_scan('s3://<url1>.metadata.json') as t1 join iceberg_scan('s3://t2.metadata.json') as t2 on 1=1

When running the above I get this error: Binder Error: Duplicate alias \"iceberg_scan_data\" in query!

Although I'm passing an alias for each iceberg_scan operation, it's still assigning the same iceberg_scan_data alias to both.

A suggestion would be to add the alias ass a parameter to the function.

sarialalem1 avatar Mar 11 '24 10:03 sarialalem1

You can use WITH clause instead of alias to define both t1 and t2.

harel-e avatar Mar 13 '24 05:03 harel-e

Yea this is a side-effect of the way things are implemented right now. This will likely change in the future, but I need to get some refactors in DuckDB-side before I can fix that

samansmink avatar Mar 27 '24 09:03 samansmink

@samansmink maybe this would be a good solution? https://github.com/duckdb/duckdb_iceberg/issues/57

mike-luabase avatar Jun 20 '24 12:06 mike-luabase

In addition to @harel-e suggestion of WITH, you can also create views, which work fine:

sql = '''
CREATE SCHEMA IF NOT EXISTS tpc;

CREATE VIEW tpc.customer AS
SELECT * FROM iceberg_scan('/Users/mritchie712/blackbird/notebooks/iceduck_tpc/default.db/customer', allow_moved_paths = true);

CREATE VIEW tpc.lineitem AS
SELECT * FROM iceberg_scan('/Users/mritchie712/blackbird/notebooks/iceduck_tpc/default.db/lineitem', allow_moved_paths = true);

CREATE VIEW tpc.nation AS
SELECT * FROM iceberg_scan('/Users/mritchie712/blackbird/notebooks/iceduck_tpc/default.db/nation', allow_moved_paths = true);

CREATE VIEW tpc.orders AS
SELECT * FROM iceberg_scan('/Users/mritchie712/blackbird/notebooks/iceduck_tpc/default.db/orders', allow_moved_paths = true);

CREATE VIEW tpc.part AS
SELECT * FROM iceberg_scan('/Users/mritchie712/blackbird/notebooks/iceduck_tpc/default.db/part', allow_moved_paths = true);

CREATE VIEW tpc.partsupp AS
SELECT * FROM iceberg_scan('/Users/mritchie712/blackbird/notebooks/iceduck_tpc/default.db/partsupp', allow_moved_paths = true);

CREATE VIEW tpc.region AS
SELECT * FROM iceberg_scan('/Users/mritchie712/blackbird/notebooks/iceduck_tpc/default.db/region', allow_moved_paths = true);

CREATE VIEW tpc.supplier AS
SELECT * FROM iceberg_scan('/Users/mritchie712/blackbird/notebooks/iceduck_tpc/default.db/supplier', allow_moved_paths = true);
'''

# Execute the SQL
con.execute(sql)
sql = f"""
select * 
from 
tpc.customer as c inner join 
tpc.orders as o on c.c_custkey = o.o_custkey
limit 20;
"""
res = con.execute(sql).fetchdf()
res

mike-luabase avatar Jun 20 '24 18:06 mike-luabase

Another way to work around this is by wrapping the iceberg scans in subqueries. So the query in the original issue would look like this:

select * from (from iceberg_scan('s3://<url1>.metadata.json')) as t1 join (from iceberg_scan('s3://t2.metadata.json')) as t2 on 1=1

JelteF avatar Jan 15 '25 15:01 JelteF

FYI this will likely be fixed by DuckDB 1.2.0 due to https://github.com/duckdb/duckdb/pull/15843 being merged

JelteF avatar Jan 24 '25 09:01 JelteF