DataFrames.jl expression functors (in particular: `over`)

pasted from slack:

consider the following type of expr

select(groupby(df, :x4), :x1 => first => :first_in_group)

this is fine in isolation, but is a little unwieldy when embedded in a select list of multiple transformations at once. technically, the below works but is obviously not idiomatic...

select(df,
    :x1 => ByRow(sqrt) => :sqrt1,
    :x2 => ByRow(log) => :log1,
    AsTable([:x4, :x1]) => (
        j -> select(groupby(DataFrame(j), :x4),
            :x1 => first => :first_in_group
        ).first_in_group
    ) => :first_in_group
)

polars has .over which is essentially an expr functor (expr, col) --> expr , so something like this could be expressed as

df.select(
    x1=np.sqrt(col("x1")),
    x2=np.log(col("x2")),
    first_in_group=col("x1").first().over("x4")
)

I'm wondering if such a feature / expr transformation like over (which I use frequently) could be implemented in DataFrames ? (or if it exists and I haven't found it yet)

reply by bkamins:

The key question is if we could add it consistently (i.e. supporting both data frame and grouped data frame as an input, and supporting both select and combine kind of operation)

Sep 06 '23 18:09 adienes

Yes - my point was about what is your proposal for these cases. In particular:

how do you expect the feature to work with GroupedDataFrame input;
how do you expect the feature to work with combine (where operations could return varying number of rows, e.g. you do over that returns 10 rows, but other operations passed to combine return e.g. 7 rows that cannot be unambiguously matched to the 10 rows retuned by over).

(maybe you are not clear what to do - it is OK then, but we must clearly see how to combine the proposal with the whole ecosystem)

Sep 06 '23 18:09 bkamins

I have not thought out all the edge cases yet, but I will try to get a list of examples for these scenarios (and then bikeshed the actual api)

at least in the basic case of GroupedDataFrame I think this can be straightforwardly handled by applying over the subgroups within each group, that is

* select(df, Over(expr, [:A, :B]))
* select(groupby(df, :A), Over(expr, :B))
* select(groupby(df, [:A, :B]), expr)
* select(df, Over(Over(expr, :B), :A)) # maybe? not sure about this one

should all be equal

Sep 06 '23 18:09 adienes

Note that groupby allows for passing sorting order of groups and even allows the groups to be reordered (which is a hard edge case to think about). But I think it can be "worked out". The combin case is a harder issue.

Sep 06 '23 19:09 bkamins

actually, sorry, could you help me create such an edge case that may be ambiguous? maybe this is a naive answer, but I am thinking that in the same way for select, then combine(groupby(df, [:A, :B]), expr) == combine(groupby(df, :A), Over(expr, :B)) == combine(df, Over(expr, [:A, :B]))

there is already an error when the column lengths from combine don't match, like

julia> combine(df, :x1 => j -> [1,2], :x2 => j -> [1,2,3])
ERROR: ArgumentError: New columns must have the same length as old columns

as an alternate API showerthought, since |> has higher precedence than =>, maybe this could apply to the column selectors and be written like

select(df, :x1 |> Over([:A, :B]) => first => :first_in_group)

etc

Sep 06 '23 20:09 adienes

Here is a problematic example:

julia> df = DataFrame(id=[1,2,1,2,3], x=1:5)
5×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4
   5 │     3      5

julia> combine(df, :x => collect∘extrema => :free_col, Over(:x => sum, :id))

and now it is unclear what should be done. In separation the expressions produce:

julia> combine(df, :x => collect∘extrema => :free_col)
2×1 DataFrame
 Row │ free_col
     │ Int64
─────┼──────────
   1 │        1
   2 │        5

julia> combine(groupby(df, :id), :x => sum)
3×2 DataFrame
 Row │ id     x_sum
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      6
   3 │     3      5

but how to link them?

Note that if :id had 2 groups we would have the same number of rows in both results, but it would still not be clear how the rows should be combined.

Note that things would yet be more complicated if the combine(groupby(df, :id), :x => sum) returned multiple rows per group (and potentially different number of rows per group).

Your examples work and are clear what to do because you work with:

select which keeps all rows and keeps their order always;
from Over you return 1 row per group (so it can be broadcasted without a problem);

And when we combine these two conditions indeed things are clear how they should work.

Sep 06 '23 20:09 bkamins

combine(df, :x => collect∘extrema => :free_col, Over(:x => sum, :id)) and now it is unclear what should be done.

so, I would say this should error for the same reason that this does.

combine(df, :x => collect∘extrema => :free_col, :x => (j -> [sum(j), std(j), mean(j)]))

if :id has 2 groups then I would hcat the resulting x_sum with free_col; that is the group id should be mostly ignored except in determining how to apply the expression, other than that it is "just" a function returning a column that can be applied like any other. if it returns multiple rows per group and/or different number of rows per group, that is ok, as long as the total number of rows returned is equal to the number in the other combine operations.

one consequence, if I understand myself correctly lol, would be

combine(groupby(df, :x), Over(expr, :x))

is a no-op, equivalent to combine(groupby(df, :x), expr))

Sep 06 '23 21:09 adienes

I think in terms of behavior, this does more or less everything I'm imagining (although obviously the implementation is inefficient and the API is pretty kludgy)

function Over(expr, col)
    return df -> select(select(groupby(df, col), expr), Not(col))
end

Sep 07 '23 01:09 adienes

You opened the discussion with the polars over refernce, so I think it is good to have a look there: https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.over.html

Note that the design of over is more complex and it provides three mapping strategies (that are polars specific)

I need to think about it more. The mental problem I have is that while I see the need for Over the challenge is to design it in a way that would be composable with the rest of the ecosystem.

Sep 07 '23 07:09 bkamins

Maybe this shouldn't be allowed with combine? That operation is more tricky than select because the meaning of rows isn't defined as clearly. And anyway it seems less useful than with select.

In terms of syntax, the most consistent extension of the current API I can think of would be select(df, :x1 => GroupBy(:x4, first) => :first_in_group).

Sep 10 '23 14:09 nalimilan

[Over with] combine is more tricky than select and seems less useful

totally agreed. in fact, part of the whole reason to want Over is explicitly to map the values back to the original rows --- thinking through semantics with combine was only an attempt to make it consistent across the "whole" ecosystem as suggested by @bkamins , but if it were instead decided to support only select and transform I think that would be reasonable. In fact there is already kind of precedent, in that groupby(::GroupedDataFrame, ::Symbol) already errors

.

Another thing to consider with the API is composition of potentially multiple expr functors (hopefully not too ambitious), with another prominent example being Filter. going back to the polars comparison, they allow expressions like this:

df = pl.DataFrame()
df = df.with_columns(
    value=pl.Series(np.random.rand(10)),
    ready=pl.Series([bool(x) for x in [1, 0, 0, 0, 1, 1, 1, 0, 1, 1]]),
    group=pl.Series([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]),
)
df.with_columns(
    first_ready_in_group=col("value").filter(col("ready")).first().over(col("group"))
)

and btw, the reason I am calling these "functors" is because in fact they really can take arbitrary expressions, like so

filter_expr = col("ready").xor((col("value") * 1000).round().mod(7).cast(bool))
over_expr = pl.when(col("group") == 1).then(1).otherwise(2)

df.with_columns(
    first_ready_in_group=col("value").filter(filter_expr).first().over(over_expr)
)

Sep 10 '23 18:09 adienes

As for functors - I think it will be too hard to add them. Our API is too different. Such things were designed to be chained.

As for the syntax:

select(df, :x1 => GroupBy(:x4, first) => :first_in_group)

would be re-written as:

select(df, df -> select(groupby(df, :x), :x1 => first => :first_in_group, keepkeys=false))

so maybe it would be clearer to use the following syntax:

select(df, GroupBy(:x4, :x1 => first  => :first_in_group))

(as then it would be fully clear that GroupBy is just a replacement for inner groupby with keepkeys=false followed by select)

Sep 10 '23 18:09 bkamins

I still think my preferred syntax, if possible, would be with a pipe, both to avoid an extra layer of function nesting and to avoid the classic problem of having to jump the cursor back to add the Over/GroupBy (as I usually start by writing down the input columns). it could look something like

select(df, :x1 |> Over(:x4) => first => :first_in_group

possibly as a curried version of

select(df, Over(:x1, :x4) => first => :first_in_group

read as "transform x1 over x4" and this could take column selectors

select(df, :x1 |> Over([:id_A, :id_B, id_C]) => first => :first_in_product_group)

I also prefer this over @nalimilan 's proposed syntax since I feel like it is somewhat of a rule that all the columns in the transformation must appear in the first of the Pair, so seeing :x1 => Groupby(:x4, expr) is at first surprising where :x4 comes from since I did not pass [:x1, :x4] =>

an attempt to enumerate the questions/decisions to be made:

how much broadcasting magic is acceptable? I think when either (each group returns a scalar) or (each group returns a vector with length of the group) it is quite clear what to do. but what if in some edge case, some groups return a scalar and others return a vector --- should they be allowed to broadcast back independently?
GroupBy vs Over - I am mostly ambivalent but do feel like Over reads nicely and is consistent with polars. My reasoning for the argument order is to follow the convention that foo(x, y) often means x foo y like the function occursin. also even in regular groupby, the grouping column is the second argument
Where should the call go / what argument order? on an expr A => B => C, we have three proposals for GroupBy(:id, A => B => C), A => GroupBy(:id, B) => C, and GroupBy(A, :id) => B => C
should combine or GroupedDataFrame be supported, and to what extent

. regarding the polars mapping_strategy, I think we can try to emulate only group_to_rows and forget the other two. mapping_strategy="join" could be obtained already with this Over proposal simply via a wrapping the resultant vectors in a Ref, and mapping_strategy="explode" is just strange...

.

As for functors - I think it will be too hard to add them. Our API is too different. Such things were designed to be chained.

ok, fair enough (especially for the general case). but wouldn't this be nice for some more "simple" compositions of Over with Filter ! (or maybe named When to keep the adjective theme going)

select(df,
    :value |> Filter(:valid, :recent; operator = ∩) |> Over(:id)
    => timeof ∘ first
    => :group_ready_time,
)

Sep 11 '23 03:09 adienes

Let us focus on the syntax. I will answer the rest of the issues once this is settled (as this will be a consequence).

Using :x1 |> Over(:x4) is just equivalent to Over(:x4)(:x1), but maybe it is OK for you?

So I understand the proposal is to have:

Over(GROUPING_COLS)(A) => B => C

which is equivalent to:

A |> Over(GROUPING_COLS) => B => C

to be rewritten as:

df -> select(groupby(df, GROUPING_COLS), A => B => C, keepkeys=false)

Sep 11 '23 19:09 bkamins

yes, that looks great IMO. Over(A, GROUPING_COLS) => B => C. also seems reasonable if you want to avoid the currying

Sep 11 '23 19:09 adienes