bug: `t1.count() + t2.count()` only uses t1
What happened?
import ibis
ibis.options.interactive = True
good_islands = ibis.memtable({"island": ["Torgersen", "Biscoe"]})
t = ibis.examples.penguins.fetch()
t.island.value_counts()
# ┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
# ┃ island ┃ island_count ┃
# ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
# │ string │ int64 │
# ├───────────┼──────────────┤
# │ Dream │ 124 │
# │ Torgersen │ 52 │
# │ Biscoe │ 168 │
# └───────────┴──────────────┘
t.count()
# 344
good = t.join(good_islands, "island", how="left_semi")
bad = t.join(good_islands, "island", how="anti")
good.count()
# 220 as expected
bad.count()
# 124 as expected
union = good.count() + bad.count()
union
# 440, not 344 as expected
ibis.show_sql(union)
# SELECT
# COUNT(*) + COUNT(*) AS "Add(CountStar(), CountStar())"
# FROM (
# SELECT
# t0.species AS species,
# t0.island AS island,
# t0.bill_length_mm AS bill_length_mm,
# t0.bill_depth_mm AS bill_depth_mm,
# t0.flipper_length_mm AS flipper_length_mm,
# t0.body_mass_g AS body_mass_g,
# t0.sex AS sex,
# t0.year AS year
# FROM main.penguins AS t0
# WHERE
# EXISTS(
# SELECT
# 1
# FROM _ibis_pandas_memtable_jg47nwdavjft7diic45wxhxpxa AS t1
# WHERE
# t0.island = t1.island
# )
# ) AS anon_1
If I use two unrelated tables:
t1 = ibis.examples.Cushings.fetch()
t2 = ibis.examples.penguins.fetch()
t1.count() + t2.count()
then I get RelationError: Selection expressions don't fully originate from dependencies of the table expression. instead. I would expect this to work.
What version of ibis are you using?
7.1.0
What backend(s) are you using, if any?
NA
Relevant log output
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
What would be the equivalent SQL for t1.count() + t2.count()?
This does actually seem to work in shell.duckdb.org to add scalar to scalar:
SELECT
(SELECT count(*) FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet') +
(SELECT count(*) FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet');
or to add a scalar with a vector:
SELECT
(SELECT count(*) FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet') + c_custkey
FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet';
But perhaps this doesn't work in all backends?
This is now achievable with our new-in-9.0 as_scalar() method, used here to explicitly construct a subquery:
In [28]: from ibis.interactive import *
...:
...: t1 = ibis.examples.Cushings.fetch()
...: t2 = ibis.examples.penguins.fetch()
...: expr = t1.count().as_scalar() + t2.count().as_scalar()
In [29]: expr
Out[29]:
┌───────────────┐
│ np.int64(371) │
└───────────────┘
In [30]: ibis.to_sql(expr, dialect="duckdb")
Out[30]:
SELECT
(
SELECT
COUNT(*) AS "CountStar(Cushings)"
FROM "Cushings" AS "t0"
) + (
SELECT
COUNT(*) AS "CountStar(penguins)"
FROM "penguins" AS "t1"
) AS "Add(ScalarSubquery(), ScalarSubquery())"
The naked t1.count() + t2.count() does have a bizarre failure mode, so I'm looking into that, but this issue can be closed.
@cpcloud isn't .count() always a scalar? I definitely assumed it was, I think (no evidence) that others will think the same thing, especially if coming from pandas/python world, where semantically thye think t.count() ~= len(t). So could we make it so that at the end of def count(self): we finish with a .as_scalar(), so both of my examples just work out of the box?
count() is still tied to a table, and Ibis can't easily determine the intent with such an expression (using it combination with other essentially random unrelated expressions).
This issue also isn't super well motivated.
Is there some concrete use case here?
Also your examples using sql are exactly what calling as_scalar() will produce.
I think this is best we can do for now without a more strongly motivated use case or reasons why combining two scalar expressions in arbitrary ways must "just work".
It might help to list out the ways in which scalars can currently be used and then define semantics for those operations.
Not at a computer right now to test, but one example of a scalar from one table needing to interact with another table is a "manual autoincrement": ibis.union(t_new.mutate(id=ibis.row_number() + t_old.id.max()), t_old)
In SQL, is my assumption correct that you can combine two scalar expressions in arbitrary ways?
Closing since this now works when doing.
In [8]: union = good.count().as_scalar() + bad.count().as_scalar()
In [9]: union
Out[9]:
┌─────┐
│ 344 │
└─────┘
xref: https://github.com/ibis-project/ibis/issues/10124#issuecomment-2352980478