ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: better api for creating aggregates over scalars

Open NickCrews opened this issue 10 months ago • 2 comments

Is your feature request related to a problem?

In SQL, you can create aggregates over scalars. However, in ibis, the ir.Scalar Values don't have any of the aggregation functions:

import ibis

ibis.options.interactive = True

# aggregating over constants works
sql = """
WITH t as (
    SELECT 1 AS a, 5 AS b UNION ALL
    SELECT 1 AS a, 7 AS b UNION ALL
    SELECT 2 AS a, 3 AS b
)
SELECT AVG(b) AS avg_b, AVG(NULL) as avg_null, AVG(43) AS avg_const
FROM t
GROUP BY a
"""
from_sql = ibis.duckdb.connect().sql(sql)
print(from_sql.execute())
# works

# You don't even need to perform the aggregation over a table/group/etc
sql = """
SELECT AVG(NULL) as avg_null, AVG(43) AS avg_const
"""
from_sql = ibis.duckdb.connect().sql(sql)
print(from_sql.execute())
# works


t = ibis.memtable({"a": [1, 1, 2], "b": [5, 7, 3]})
t.group_by("a").aggregate(
    avg_b=t.b.mean(),
    avg_null=ibis.null(float).mean(),  # AttributeError: 'FloatingScalar' object has no attribute 'mean'
    avg_const=ibis.literal(43).mean(),  # AttributeError: 'IntegerScalar' object has no attribute 'mean'
)

What is the motivation behind your request?

I have some dataclass like

@dataclass
class Address:
    street1: ir.StringValue
    street2: ir.StringValue
    city: ir.StringValue
    ...

which I use as an abstraction layer to smooth over the differences in schemas that my vaiours tables have. Then I have some factory method, where the street2 field is optional. If not given, it infers NULL:

def make_address(t: ibis.Table):
    if "street2" is not in t.columns
        street2 = ibis.null(str)
    else:
        street2 = t.street2
    ....
    return Address(...)

Then I want to be able to use these as

def canonicalize_addresses(t: ibis.Table):
    address = make_address(t)
    quality = rank_quality(address)
    aggs = {
         street1: address.street1.first(order_by=quality),
         street2: address.street2.first(order_by=quality),
         ....,
     },
     return t.group_by("address_id").agg(**aggs)

In other words, once I get the Address dataclass back, I want to be able to treat these as values from the relation. This abstraction works great except for when I have to perform aggregations like in the above.

Describe the solution you'd like

Several options

  1. Don't change anything. I will need to add in an extra .select() somehwere in all using code so that the ibis.null() is turned from a ir.Scalar to to a ir.Column.
  2. Add a as_column() method to Scalars, Columns, and Tables, similar to the existing .as_scalar() and .as_table() methods. Then I will have to explicitly call this function.
  3. Add aggregation methods to ir.Scalar objects. IDK if we should make it so that ibis.literal(42).mean().execute() gives an error (similar to current behavior) or perhaps we should do what SQL does, and have this result in 42.

What version of ibis are you running?

main

What backend(s) are you using, if any?

duckdb, but I think this should affect all SQL backends the same. I think duckdbs handling of SELECT AVG(42) should be the same in all backends?

Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

NickCrews avatar Jun 02 '25 18:06 NickCrews

Hm, yeah.

One weird thing here is that adding aggregate methods to scalars sort of obviates the need for scalars and columns, since they now just become special cases of tables. I'm not saying we should do that, but it does feel weird to add aggregate methods to scalars and then also keep the distinction of columns and scalars in the type system. Not to mention all the code that would break if we made such a drastic change.

cpcloud avatar Jun 02 '25 19:06 cpcloud

I think the distinction between scalars and columns in the type system is good. I think breaking the type system for so many users would be unacceptable, we need something not that breaking.

What about a (Table | ir.Value).as_column() method, similar to the existing .as_table() and .as_scalar() methods? Then in my example above I would do ...address.street1.as_column().first(order_by=quality)

NickCrews avatar Nov 10 '25 16:11 NickCrews