ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat(api): Add `order_by` parameter to the `collect()` method

Open chelsea-lin opened this issue 1 year ago • 3 comments

Is your feature request related to a problem?

The collect() method in Ibis is a convenient way to aggregate data into arrays when working with BigQuery backends. It translates to the ARRAY_AGG function in BigQuery SQL. However, there's currently no built-in way to specify an ORDER BY clause within the ARRAY_AGG aggregation. This means that the order of elements in the resulting array is not guaranteed, which can be problematic when the order of elements matters for the downstream analysis.

What is the motivation behind your request?

No response

Describe the solution you'd like

I propose adding an order_by parameter to the collect() method. This parameter would accept either:

  • A string or list of strings representing column names to order by.
  • An Ibis expression indicating the ordering.

This would allow users to explicitly control the ordering of elements within the aggregated array, making the results more predictable and useful.

What version of ibis are you running?

9.0.0

What backend(s) are you using, if any?

BigQuery

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

chelsea-lin avatar May 10 '24 18:05 chelsea-lin

Thanks for opening this! I agree that supporting ordering in collect would be useful (we're also adding an order_by kwarg to Table.distinct for similar reasons).

I'm not sure how well an order_by kwarg would work though in the existing collect() method though - given that collect is on a Column instead of a Table, having the ordering refer to columns on the parent table may feel weird.

t = ibis.table({"a": "int", "b": "int, "c": "int"})

t.a.collect(order_by=("b", "c"))  # here `b` and `c` refer to columns on the parent table of `a`
t.a.collect(order_by=(t.b, t.c))   # explicit column references also work

Typing all that out doesn't feel that odd, but AFAIK we don't have any other columnar operations that support an ordering specification, so we don't have an existing pattern to mirror here.

Edit: actually, we do have some precedent for binding args to a column method to the parent table - the where arg to all column reductions. This currently doesn't accept string arguments, only deferred operators, but given that then I think the spelling here is a logical extension.

jcrist avatar May 10 '24 18:05 jcrist

group_concat is another operation where this would make sense, along with the first and last aggregate functions. I think we should add them on a case by case basis for now and do an audit of the aggs where ordering has an effect on the output.

cpcloud avatar May 10 '24 19:05 cpcloud

A quick survey of backends and current ops shows those are the main 4 operations. I've taken a quick look through implementing this (needed a fun friday break) and like most things ibis it's turned up some internal cleanups we'll want to do first but I think this should be only a medium-sized lift to implement.

jcrist avatar May 10 '24 19:05 jcrist