ibis icon indicating copy to clipboard operation
ibis copied to clipboard

bug: `t1.count() + t2.count()` only uses t1

Open NickCrews opened this issue 2 years ago • 10 comments

What happened?

import ibis

ibis.options.interactive = True

good_islands = ibis.memtable({"island": ["Torgersen", "Biscoe"]})
t = ibis.examples.penguins.fetch()
t.island.value_counts()
# ┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
# ┃ island    ┃ island_count ┃
# ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
# │ string    │ int64        │
# ├───────────┼──────────────┤
# │ Dream     │          124 │
# │ Torgersen │           52 │
# │ Biscoe    │          168 │
# └───────────┴──────────────┘
t.count()
# 344


good = t.join(good_islands, "island", how="left_semi")
bad = t.join(good_islands, "island", how="anti")

good.count()
# 220 as expected

bad.count()
# 124 as expected

union = good.count() + bad.count()
union
# 440, not 344 as expected

ibis.show_sql(union)
# SELECT
#   COUNT(*) + COUNT(*) AS "Add(CountStar(), CountStar())"
# FROM (
#   SELECT
#     t0.species AS species,
#     t0.island AS island,
#     t0.bill_length_mm AS bill_length_mm,
#     t0.bill_depth_mm AS bill_depth_mm,
#     t0.flipper_length_mm AS flipper_length_mm,
#     t0.body_mass_g AS body_mass_g,
#     t0.sex AS sex,
#     t0.year AS year
#   FROM main.penguins AS t0
#   WHERE
#     EXISTS(
#       SELECT
#         1
#       FROM _ibis_pandas_memtable_jg47nwdavjft7diic45wxhxpxa AS t1
#       WHERE
#         t0.island = t1.island
#     )
# ) AS anon_1

If I use two unrelated tables:

t1 = ibis.examples.Cushings.fetch()
t2 = ibis.examples.penguins.fetch()
t1.count() + t2.count()

then I get RelationError: Selection expressions don't fully originate from dependencies of the table expression. instead. I would expect this to work.

What version of ibis are you using?

7.1.0

What backend(s) are you using, if any?

NA

Relevant log output

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

NickCrews avatar Dec 04 '23 20:12 NickCrews

What would be the equivalent SQL for t1.count() + t2.count()?

cpcloud avatar Dec 06 '23 21:12 cpcloud

This does actually seem to work in shell.duckdb.org to add scalar to scalar:

SELECT
   (SELECT count(*) FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet') +
   (SELECT count(*) FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet');

or to add a scalar with a vector:

SELECT
  (SELECT count(*) FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet') + c_custkey
FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet';

But perhaps this doesn't work in all backends?

NickCrews avatar Dec 06 '23 23:12 NickCrews

This is now achievable with our new-in-9.0 as_scalar() method, used here to explicitly construct a subquery:

In [28]: from ibis.interactive import *
    ...:
    ...: t1 = ibis.examples.Cushings.fetch()
    ...: t2 = ibis.examples.penguins.fetch()
    ...: expr = t1.count().as_scalar() + t2.count().as_scalar()

In [29]: expr
Out[29]:
┌───────────────┐
│ np.int64(371) │
└───────────────┘

In [30]: ibis.to_sql(expr, dialect="duckdb")
Out[30]:
SELECT
  (
    SELECT
      COUNT(*) AS "CountStar(Cushings)"
    FROM "Cushings" AS "t0"
  ) + (
    SELECT
      COUNT(*) AS "CountStar(penguins)"
    FROM "penguins" AS "t1"
  ) AS "Add(ScalarSubquery(), ScalarSubquery())"

The naked t1.count() + t2.count() does have a bizarre failure mode, so I'm looking into that, but this issue can be closed.

cpcloud avatar Jun 27 '24 17:06 cpcloud

@cpcloud isn't .count() always a scalar? I definitely assumed it was, I think (no evidence) that others will think the same thing, especially if coming from pandas/python world, where semantically thye think t.count() ~= len(t). So could we make it so that at the end of def count(self): we finish with a .as_scalar(), so both of my examples just work out of the box?

NickCrews avatar Jun 27 '24 19:06 NickCrews

count() is still tied to a table, and Ibis can't easily determine the intent with such an expression (using it combination with other essentially random unrelated expressions).

cpcloud avatar Jun 27 '24 19:06 cpcloud

This issue also isn't super well motivated.

Is there some concrete use case here?

cpcloud avatar Jun 27 '24 19:06 cpcloud

Also your examples using sql are exactly what calling as_scalar() will produce.

cpcloud avatar Jun 27 '24 19:06 cpcloud

I think this is best we can do for now without a more strongly motivated use case or reasons why combining two scalar expressions in arbitrary ways must "just work".

It might help to list out the ways in which scalars can currently be used and then define semantics for those operations.

cpcloud avatar Jun 27 '24 19:06 cpcloud

Not at a computer right now to test, but one example of a scalar from one table needing to interact with another table is a "manual autoincrement": ibis.union(t_new.mutate(id=ibis.row_number() + t_old.id.max()), t_old)

NickCrews avatar Jun 28 '24 06:06 NickCrews

In SQL, is my assumption correct that you can combine two scalar expressions in arbitrary ways?

NickCrews avatar Jul 06 '24 16:07 NickCrews

Closing since this now works when doing.

In [8]: union = good.count().as_scalar() + bad.count().as_scalar()

In [9]: union
Out[9]: 
┌─────┐
│ 344 │
└─────┘

xref: https://github.com/ibis-project/ibis/issues/10124#issuecomment-2352980478

ncclementi avatar Sep 19 '24 22:09 ncclementi