expression functors (in particular: `over`)
pasted from slack:
consider the following type of expr
select(groupby(df, :x4), :x1 => first => :first_in_group)
this is fine in isolation, but is a little unwieldy when embedded in a select list of multiple transformations at once. technically, the below works but is obviously not idiomatic...
select(df,
:x1 => ByRow(sqrt) => :sqrt1,
:x2 => ByRow(log) => :log1,
AsTable([:x4, :x1]) => (
j -> select(groupby(DataFrame(j), :x4),
:x1 => first => :first_in_group
).first_in_group
) => :first_in_group
)
polars has .over which is essentially an expr functor (expr, col) --> expr , so something like this could be expressed as
df.select(
x1=np.sqrt(col("x1")),
x2=np.log(col("x2")),
first_in_group=col("x1").first().over("x4")
)
I'm wondering if such a feature / expr transformation like over (which I use frequently) could be implemented in DataFrames ? (or if it exists and I haven't found it yet)
reply by bkamins:
The key question is if we could add it consistently (i.e. supporting both data frame and grouped data frame as an input, and supporting both select and combine kind of operation)
Yes - my point was about what is your proposal for these cases. In particular:
- how do you expect the feature to work with
GroupedDataFrameinput; - how do you expect the feature to work with
combine(where operations could return varying number of rows, e.g. you dooverthat returns 10 rows, but other operations passed tocombinereturn e.g. 7 rows that cannot be unambiguously matched to the 10 rows retuned byover).
(maybe you are not clear what to do - it is OK then, but we must clearly see how to combine the proposal with the whole ecosystem)
I have not thought out all the edge cases yet, but I will try to get a list of examples for these scenarios (and then bikeshed the actual api)
at least in the basic case of GroupedDataFrame I think this can be straightforwardly handled by applying over the subgroups within each group, that is
* select(df, Over(expr, [:A, :B]))
* select(groupby(df, :A), Over(expr, :B))
* select(groupby(df, [:A, :B]), expr)
* select(df, Over(Over(expr, :B), :A)) # maybe? not sure about this one
should all be equal
Note that groupby allows for passing sorting order of groups and even allows the groups to be reordered (which is a hard edge case to think about). But I think it can be "worked out". The combin case is a harder issue.
actually, sorry, could you help me create such an edge case that may be ambiguous? maybe this is a naive answer, but I am thinking that in the same way for select, then combine(groupby(df, [:A, :B]), expr) == combine(groupby(df, :A), Over(expr, :B)) == combine(df, Over(expr, [:A, :B]))
there is already an error when the column lengths from combine don't match, like
julia> combine(df, :x1 => j -> [1,2], :x2 => j -> [1,2,3])
ERROR: ArgumentError: New columns must have the same length as old columns
as an alternate API showerthought, since |> has higher precedence than =>, maybe this could apply to the column selectors and be written like
select(df, :x1 |> Over([:A, :B]) => first => :first_in_group)
etc
Here is a problematic example:
julia> df = DataFrame(id=[1,2,1,2,3], x=1:5)
5×2 DataFrame
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 1 3
4 │ 2 4
5 │ 3 5
julia> combine(df, :x => collect∘extrema => :free_col, Over(:x => sum, :id))
and now it is unclear what should be done. In separation the expressions produce:
julia> combine(df, :x => collect∘extrema => :free_col)
2×1 DataFrame
Row │ free_col
│ Int64
─────┼──────────
1 │ 1
2 │ 5
julia> combine(groupby(df, :id), :x => sum)
3×2 DataFrame
Row │ id x_sum
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 6
3 │ 3 5
but how to link them?
Note that if :id had 2 groups we would have the same number of rows in both results, but it would still not be clear how the rows should be combined.
Note that things would yet be more complicated if the combine(groupby(df, :id), :x => sum) returned multiple rows per group (and potentially different number of rows per group).
Your examples work and are clear what to do because you work with:
selectwhich keeps all rows and keeps their order always;- from
Overyou return1row per group (so it can be broadcasted without a problem);
And when we combine these two conditions indeed things are clear how they should work.
combine(df, :x => collect∘extrema => :free_col, Over(:x => sum, :id))and now it is unclear what should be done.
so, I would say this should error for the same reason that this does.
combine(df, :x => collect∘extrema => :free_col, :x => (j -> [sum(j), std(j), mean(j)]))
if :id has 2 groups then I would hcat the resulting x_sum with free_col; that is the group id should be mostly ignored except in determining how to apply the expression, other than that it is "just" a function returning a column that can be applied like any other. if it returns multiple rows per group and/or different number of rows per group, that is ok, as long as the total number of rows returned is equal to the number in the other combine operations.
one consequence, if I understand myself correctly lol, would be
combine(groupby(df, :x), Over(expr, :x))
is a no-op, equivalent to combine(groupby(df, :x), expr))
I think in terms of behavior, this does more or less everything I'm imagining (although obviously the implementation is inefficient and the API is pretty kludgy)
function Over(expr, col)
return df -> select(select(groupby(df, col), expr), Not(col))
end
You opened the discussion with the polars over refernce, so I think it is good to have a look there: https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.over.html
Note that the design of over is more complex and it provides three mapping strategies (that are polars specific)
I need to think about it more. The mental problem I have is that while I see the need for Over the challenge is to design it in a way that would be composable with the rest of the ecosystem.
Maybe this shouldn't be allowed with combine? That operation is more tricky than select because the meaning of rows isn't defined as clearly. And anyway it seems less useful than with select.
In terms of syntax, the most consistent extension of the current API I can think of would be select(df, :x1 => GroupBy(:x4, first) => :first_in_group).
[
Overwith]combineis more tricky thanselectand seems less useful
totally agreed. in fact, part of the whole reason to want Over is explicitly to map the values back to the original rows --- thinking through semantics with combine was only an attempt to make it consistent across the "whole" ecosystem as suggested by @bkamins , but if it were instead decided to support only select and transform I think that would be reasonable. In fact there is already kind of precedent, in that groupby(::GroupedDataFrame, ::Symbol) already errors
.
Another thing to consider with the API is composition of potentially multiple expr functors (hopefully not too ambitious), with another prominent example being Filter. going back to the polars comparison, they allow expressions like this:
df = pl.DataFrame()
df = df.with_columns(
value=pl.Series(np.random.rand(10)),
ready=pl.Series([bool(x) for x in [1, 0, 0, 0, 1, 1, 1, 0, 1, 1]]),
group=pl.Series([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]),
)
df.with_columns(
first_ready_in_group=col("value").filter(col("ready")).first().over(col("group"))
)
and btw, the reason I am calling these "functors" is because in fact they really can take arbitrary expressions, like so
filter_expr = col("ready").xor((col("value") * 1000).round().mod(7).cast(bool))
over_expr = pl.when(col("group") == 1).then(1).otherwise(2)
df.with_columns(
first_ready_in_group=col("value").filter(filter_expr).first().over(over_expr)
)
As for functors - I think it will be too hard to add them. Our API is too different. Such things were designed to be chained.
As for the syntax:
select(df, :x1 => GroupBy(:x4, first) => :first_in_group)
would be re-written as:
select(df, df -> select(groupby(df, :x), :x1 => first => :first_in_group, keepkeys=false))
so maybe it would be clearer to use the following syntax:
select(df, GroupBy(:x4, :x1 => first => :first_in_group))
(as then it would be fully clear that GroupBy is just a replacement for inner groupby with keepkeys=false followed by select)
I still think my preferred syntax, if possible, would be with a pipe, both to avoid an extra layer of function nesting and to avoid the classic problem of having to jump the cursor back to add the Over/GroupBy (as I usually start by writing down the input columns). it could look something like
select(df, :x1 |> Over(:x4) => first => :first_in_group
possibly as a curried version of
select(df, Over(:x1, :x4) => first => :first_in_group
read as "transform x1 over x4" and this could take column selectors
select(df, :x1 |> Over([:id_A, :id_B, id_C]) => first => :first_in_product_group)
I also prefer this over @nalimilan 's proposed syntax since I feel like it is somewhat of a rule that all the columns in the transformation must appear in the first of the Pair, so seeing :x1 => Groupby(:x4, expr) is at first surprising where :x4 comes from since I did not pass [:x1, :x4] =>
an attempt to enumerate the questions/decisions to be made:
- how much broadcasting magic is acceptable? I think when either (each group returns a scalar) or (each group returns a vector with length of the group) it is quite clear what to do. but what if in some edge case, some groups return a scalar and others return a vector --- should they be allowed to broadcast back independently?
GroupByvsOver- I am mostly ambivalent but do feel likeOverreads nicely and is consistent withpolars. My reasoning for the argument order is to follow the convention thatfoo(x, y)often meansx foo ylike the functionoccursin. also even in regulargroupby, the grouping column is the second argument- Where should the call go / what argument order? on an expr
A => B => C, we have three proposals forGroupBy(:id, A => B => C),A => GroupBy(:id, B) => C, andGroupBy(A, :id) => B => C - should
combineorGroupedDataFramebe supported, and to what extent
.
regarding the polars mapping_strategy, I think we can try to emulate only group_to_rows and forget the other two. mapping_strategy="join" could be obtained already with this Over proposal simply via a wrapping the resultant vectors in a Ref, and mapping_strategy="explode" is just strange...
.
As for functors - I think it will be too hard to add them. Our API is too different. Such things were designed to be chained.
ok, fair enough (especially for the general case). but wouldn't this be nice for some more "simple" compositions of Over with Filter ! (or maybe named When to keep the adjective theme going)
select(df,
:value |> Filter(:valid, :recent; operator = ∩) |> Over(:id)
=> timeof ∘ first
=> :group_ready_time,
)
Let us focus on the syntax. I will answer the rest of the issues once this is settled (as this will be a consequence).
Using :x1 |> Over(:x4) is just equivalent to Over(:x4)(:x1), but maybe it is OK for you?
So I understand the proposal is to have:
Over(GROUPING_COLS)(A) => B => C
which is equivalent to:
A |> Over(GROUPING_COLS) => B => C
to be rewritten as:
df -> select(groupby(df, GROUPING_COLS), A => B => C, keepkeys=false)
yes, that looks great IMO. Over(A, GROUPING_COLS) => B => C. also seems reasonable if you want to avoid the currying