DataFrames.jl
DataFrames.jl copied to clipboard
Combining `pairs(::GroupDataFrame)`
I often find myself doing something like
combine(groupby(df, x)) do groupdf
x = first(groupdf.x)
# ...
end
It would be nice to be able to do something along the lines of
combine(pairs(groupby(df, x))) do groupkey, groupdf
x = groupkey.x
# ...
end
There is currently no such method for combine
that would accept the output of pairs(::GroupedDataFrame)
, which is a generator over a zip
of the group keys and grouped data, a type not particularly conducive to dispatch.
I spoke a bit with @bkamins about this in Slack, who noted that:
we would have to go for:
combine(::Function, ::Base.Iterators.Enumerate{<:GroupedDataFrame})
signature and require:
combine(enumerate(gdf)) do idx, sdf ... end
The only downside of this pattern (fortunately not user-visible) is that it would require a completely separate code path.
Being able to access the group index would indeed be nice if it is not feasible to access the group key.
Actually I thought you wanted a group number not a group key :smile:.
combine(pairs(groupby(df, x))) do groupkey, groupdf
x = groupkey.x
# ...
end
is also potentially feasible with the signature:
combine(::Function, ::Base.Generator{<:Base.Iterators.Zip{<:Tuple{DataFrames.GroupKeys,GroupedDataFrame}}})
which is not problematic for dispatch.
The questions are:
- do we want to add it at all (it will be a bit complicated to implement, but doable)
- if yes do we want it only for
combine
or also forselect
andtransform
? - if yes - how flexible should be the signature
A most general approach would be to allow any signature we currently allow and keep the src => fun => dst
work as they work with GroupedDataFrame
, but only change what is passed to a function when it is passed without =>
syntax (also this would be the easiest to implement probably as we would just keep a flag if GroupedDataFrame
, pairs
or enuerate
was passed and feed an appropriate arguments to the transformation function).
which is not problematic for dispatch.
Right, not in practice, but I find dispatching on a generator type a bit fishy given that Generator
itself isn't exported/necessarily public.
You mean you find x = groupkey.x
much cleaner than x = first(groupdf.x)
? :-D If you count the additional do groupkey
, then it's actually longer!
do we want to add it at all
This is what I meant by this. We can keep adding features to combine
/select
/transform
, but just have a look at its docstring (try outputting it in a terminal). It is already almost indigestible. That is why I prefer to be careful to add only things that are really needed (or are very simple conceptually).
You mean you find
x = groupkey.x
much cleaner thanx = first(groupdf.x)
? :-D
Not cleaner so much as clearer; a colleague of mine asked what the first
was doing when I had been using this pattern, and he suggested only(unique(groupdf.x))
as a clear and defensive alternative (which is of course slower).
If you count the additional
do groupkey
, then it's actually longer!
Haha yes indeed. I just find that it's a bit more descriptive as to what it's doing.
That is why I prefer to be careful to add only things that are really needed (or are very simple conceptually).
100% agree with that. This isn't needed per se, as there are multiple ways to write this already (e.g. first
and only
+unique
), though I do find it conceptually simple. If there's any doubt, I'm happy to close this, I won't fight too hard for it. :slightly_smiling_face:
Let us keep it open as it is appealing, to keep track of it in the future.
You mean you find
x = groupkey.x
much cleaner thanx = first(groupdf.x)
? :-D If you count the additionaldo groupkey
, then it's actually longer!
Following https://github.com/JuliaLang/julia/pull/39285, in Julia 1.7 it could actually be shorter 🙂
combine(pairs(groupby(df, x))) do (; x), groupdf
# ...
end