feat: better api for creating aggregates over scalars
Is your feature request related to a problem?
In SQL, you can create aggregates over scalars. However, in ibis, the ir.Scalar Values don't have any of the aggregation functions:
import ibis
ibis.options.interactive = True
# aggregating over constants works
sql = """
WITH t as (
SELECT 1 AS a, 5 AS b UNION ALL
SELECT 1 AS a, 7 AS b UNION ALL
SELECT 2 AS a, 3 AS b
)
SELECT AVG(b) AS avg_b, AVG(NULL) as avg_null, AVG(43) AS avg_const
FROM t
GROUP BY a
"""
from_sql = ibis.duckdb.connect().sql(sql)
print(from_sql.execute())
# works
# You don't even need to perform the aggregation over a table/group/etc
sql = """
SELECT AVG(NULL) as avg_null, AVG(43) AS avg_const
"""
from_sql = ibis.duckdb.connect().sql(sql)
print(from_sql.execute())
# works
t = ibis.memtable({"a": [1, 1, 2], "b": [5, 7, 3]})
t.group_by("a").aggregate(
avg_b=t.b.mean(),
avg_null=ibis.null(float).mean(), # AttributeError: 'FloatingScalar' object has no attribute 'mean'
avg_const=ibis.literal(43).mean(), # AttributeError: 'IntegerScalar' object has no attribute 'mean'
)
What is the motivation behind your request?
I have some dataclass like
@dataclass
class Address:
street1: ir.StringValue
street2: ir.StringValue
city: ir.StringValue
...
which I use as an abstraction layer to smooth over the differences in schemas that my vaiours tables have. Then I have some factory method, where the street2 field is optional. If not given, it infers NULL:
def make_address(t: ibis.Table):
if "street2" is not in t.columns
street2 = ibis.null(str)
else:
street2 = t.street2
....
return Address(...)
Then I want to be able to use these as
def canonicalize_addresses(t: ibis.Table):
address = make_address(t)
quality = rank_quality(address)
aggs = {
street1: address.street1.first(order_by=quality),
street2: address.street2.first(order_by=quality),
....,
},
return t.group_by("address_id").agg(**aggs)
In other words, once I get the Address dataclass back, I want to be able to treat these as values from the relation. This abstraction works great except for when I have to perform aggregations like in the above.
Describe the solution you'd like
Several options
- Don't change anything. I will need to add in an extra .select() somehwere in all using code so that the
ibis.null()is turned from a ir.Scalar to to a ir.Column. - Add a
as_column()method to Scalars, Columns, and Tables, similar to the existing .as_scalar() and .as_table() methods. Then I will have to explicitly call this function. - Add aggregation methods to ir.Scalar objects. IDK if we should make it so that
ibis.literal(42).mean().execute()gives an error (similar to current behavior) or perhaps we should do what SQL does, and have this result in 42.
What version of ibis are you running?
main
What backend(s) are you using, if any?
duckdb, but I think this should affect all SQL backends the same. I think duckdbs handling of SELECT AVG(42) should be the same in all backends?
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
Hm, yeah.
One weird thing here is that adding aggregate methods to scalars sort of obviates the need for scalars and columns, since they now just become special cases of tables. I'm not saying we should do that, but it does feel weird to add aggregate methods to scalars and then also keep the distinction of columns and scalars in the type system. Not to mention all the code that would break if we made such a drastic change.
I think the distinction between scalars and columns in the type system is good. I think breaking the type system for so many users would be unacceptable, we need something not that breaking.
What about a (Table | ir.Value).as_column() method, similar to the existing .as_table() and .as_scalar() methods? Then in my example above I would do ...address.street1.as_column().first(order_by=quality)