bug: IntegrityError when chaining calls to `outer_join`
What happened?
When executing the following function chain (pseudocode):
base_aggrs = [
ibis.Table(py_table).group_by().aggregate()
for py_table in (py_table0, ..., py_table3)
]
query = base_aggrs[0].outer_join(base_aggrs[1], ...).select(...)
.outer_join(base_aggrs[2], ...).select(...)
The error I get is:
*** ibis.common.exceptions.IntegrityError: Cannot add <ibis.expr.operations.logical.Equals object at 0x10dd21650> to projection, they belong to another relation
The first outer_join results in an ibis.Table (<class 'ibis.expr.types.relations.Table'>), and I would expect the chain to continually produce an ibis.Table.
Note that I am actually doing this in a loop as seen here (though the code I'm executing has been updated to use just ibis.table instead of a pandas connection).
What version of ibis are you using?
python 3.12.7 (Python 3.12.7 (main, Oct 1 2024, 02:05:46) [Clang 16.0.0 (clang-1600.0.26.3)] on darwin)
ibis-framework[duckdb]==9.5.0
>>> import ibis
>>> ibis.__version__
'9.5.0'
What backend(s) are you using, if any?
DuckDB
Relevant log output (pdb excerpts)
I am able to successfully run one iteration of outer_join as follows:
convenience function:
def AggregateJoin(left_table, right_table):
return (
left_table.outer_join(right_table, left_table.gene_id == right_table.gene_id)
.select(
ibis.coalesce(left_table.gene_id , right_table.gene_id ).name('gene_id')
, (left_table.cell_count + right_table.cell_count).name('cell_count')
, (left_table.expr_total + right_table.expr_total).name('expr_total')
)
)
Debugging (left_table):
>>> left_table
r0 := InMemoryTable
data:
PyArrowTableProxy:
pyarrow.Table
gene_id: string
cell_id: string
expression: float
----
gene_id: [["ENSG00000004455","ENSG00000004059","ENSG00000003756","ENSG00000003436","ENSG00000003402"]]
cell_id: [["SRR5766151","SRR5766151","SRR5766151","SRR5766151","SRR5766151"]]
expression: [[1,397,1,6,67.35789]]
Aggregate[r0]
groups:
gene_id: r0.gene_id
metrics:
cell_count: CountStar(r0)
expr_total: Sum(r0.expression)
Debugging (right_table):
>>> right_table
r0 := InMemoryTable
data:
PyArrowTableProxy:
pyarrow.Table
gene_id: string
cell_id: string
expression: float
----
gene_id: [["ENSG00000003147","ENSG00000002746","ENSG00000002586","ENSG00000001460","ENSG00000000457"]]
cell_id: [["SRR5766151","SRR5766151","SRR5766151","SRR5766151","SRR5766151"]]
expression: [[489,3,1755,1,4.282037]]
Aggregate[r0]
groups:
gene_id: r0.gene_id
metrics:
cell_count: CountStar(r0)
expr_total: Sum(r0.expression)
Debugging (result):
>>> query = AggregateJoin(t1, t2)
>>> query
r0 := InMemoryTable
data:
PyArrowTableProxy:
pyarrow.Table
gene_id: string
cell_id: string
expression: float
----
gene_id: [["ENSG00000004455","ENSG00000004059","ENSG00000003756","ENSG00000003436","ENSG00000003402"]]
cell_id: [["SRR5766151","SRR5766151","SRR5766151","SRR5766151","SRR5766151"]]
expression: [[1,397,1,6,67.35789]]
r1 := InMemoryTable
data:
PyArrowTableProxy:
pyarrow.Table
gene_id: string
cell_id: string
expression: float
----
gene_id: [["ENSG00000003147","ENSG00000002746","ENSG00000002586","ENSG00000001460","ENSG00000000457"]]
cell_id: [["SRR5766151","SRR5766151","SRR5766151","SRR5766151","SRR5766151"]]
expression: [[489,3,1755,1,4.282037]]
r2 := Aggregate[r0]
groups:
gene_id: r0.gene_id
metrics:
cell_count: CountStar(r0)
expr_total: Sum(r0.expression)
r3 := Aggregate[r1]
groups:
gene_id: r1.gene_id
metrics:
cell_count: CountStar(r1)
expr_total: Sum(r1.expression)
JoinChain[r2]
JoinLink[outer, r3]
r2.gene_id == r3.gene_id
values:
gene_id: Coalesce([r2.gene_id, r3.gene_id])
cell_count: r2.cell_count + r3.cell_count
expr_total: r2.expr_total + r3.expr_total
Then, a second iteration throws the error, as follows.
Debugging (t3):
>>> t3
r0 := InMemoryTable
data:
PyArrowTableProxy:
pyarrow.Table
gene_id: string
cell_id: string
expression: float
----
gene_id: [["ENSG00000285991","ENSG00000285920","ENSG00000285733","ENSG00000285721","ENSG00000285629"]]
cell_id: [["SRR5765852","SRR5765852","SRR5765852","SRR5765852","SRR5765852"]]
expression: [[1,2,1.0522599,1.3091263,1.347101]]
Aggregate[r0]
groups:
gene_id: r0.gene_id
metrics:
cell_count: CountStar(r0)
expr_total: Sum(r0.expression)
The actual error:
>>> AggregateJoin(query, t3)
*** ibis.common.exceptions.IntegrityError: Cannot add <ibis.expr.operations.logical.Equals object at 0x109f00bd0> to projection, they belong to another relation
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
I'm not sure if this is actually a bug or if something I used to do is no longer valid.
Note that this code was working prior to the large ibis rewrite, but I have tried to update to latest version of ibis and dropping the use of ibis_conn = ibis.pandas.connect({}). Now, instead of getting a pyarrow table via ibis_conn.table() I'm using ibis.memtable(<pyarrow.Table>, name='some_name').
If any other context is needed on this, please let me know!
Can you please make the reproducer copypastable?
sure, I can do that by end of day tomorrow.
turns out by "tomorrow" I meant "in 2 days".
here's sample code that reproduces it for me: https://gist.github.com/drin/ecbf5ed90de749e420eafbdd70a76750
lines 45-52 represents an unrolled loop where the error is thrown, and lines 19-36 should be where the query itself is built; QueryAverageByGeneID just defines a group-by query on a test table and QueryJoinAverages accumulates linear queries using outer_join.
Also, this occurs with ibis==9.5.0
I am trying to do essentially the same thing and getting the same error with ibis==10.6.0.
Is this confirmed to be a bug? Did you find a way to outer join on the same keys and coalesce multiple tables?