DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Combining `pairs(::GroupDataFrame)`

Open ararslan opened this issue 4 years ago • 7 comments

I often find myself doing something like

combine(groupby(df, x)) do groupdf
    x = first(groupdf.x)
    # ...
end

It would be nice to be able to do something along the lines of

combine(pairs(groupby(df, x))) do groupkey, groupdf
    x = groupkey.x
    # ...
end

There is currently no such method for combine that would accept the output of pairs(::GroupedDataFrame), which is a generator over a zip of the group keys and grouped data, a type not particularly conducive to dispatch.

I spoke a bit with @bkamins about this in Slack, who noted that:

we would have to go for:

combine(::Function, ::Base.Iterators.Enumerate{<:GroupedDataFrame})

signature and require:

combine(enumerate(gdf)) do idx, sdf
    ...
end

The only downside of this pattern (fortunately not user-visible) is that it would require a completely separate code path.

Being able to access the group index would indeed be nice if it is not feasible to access the group key.

ararslan avatar Jan 06 '21 23:01 ararslan

Actually I thought you wanted a group number not a group key :smile:.

combine(pairs(groupby(df, x))) do groupkey, groupdf
    x = groupkey.x
    # ...
end

is also potentially feasible with the signature:

combine(::Function, ::Base.Generator{<:Base.Iterators.Zip{<:Tuple{DataFrames.GroupKeys,GroupedDataFrame}}})

which is not problematic for dispatch.

The questions are:

  1. do we want to add it at all (it will be a bit complicated to implement, but doable)
  2. if yes do we want it only for combine or also for select and transform?
  3. if yes - how flexible should be the signature

A most general approach would be to allow any signature we currently allow and keep the src => fun => dst work as they work with GroupedDataFrame, but only change what is passed to a function when it is passed without => syntax (also this would be the easiest to implement probably as we would just keep a flag if GroupedDataFrame, pairs or enuerate was passed and feed an appropriate arguments to the transformation function).

bkamins avatar Jan 06 '21 23:01 bkamins

which is not problematic for dispatch.

Right, not in practice, but I find dispatching on a generator type a bit fishy given that Generator itself isn't exported/necessarily public.

ararslan avatar Jan 07 '21 00:01 ararslan

You mean you find x = groupkey.x much cleaner than x = first(groupdf.x)? :-D If you count the additional do groupkey, then it's actually longer!

nalimilan avatar Jan 07 '21 08:01 nalimilan

do we want to add it at all

This is what I meant by this. We can keep adding features to combine/select/transform, but just have a look at its docstring (try outputting it in a terminal). It is already almost indigestible. That is why I prefer to be careful to add only things that are really needed (or are very simple conceptually).

bkamins avatar Jan 07 '21 08:01 bkamins

You mean you find x = groupkey.x much cleaner than x = first(groupdf.x)? :-D

Not cleaner so much as clearer; a colleague of mine asked what the first was doing when I had been using this pattern, and he suggested only(unique(groupdf.x)) as a clear and defensive alternative (which is of course slower).

If you count the additional do groupkey, then it's actually longer!

Haha yes indeed. I just find that it's a bit more descriptive as to what it's doing.

That is why I prefer to be careful to add only things that are really needed (or are very simple conceptually).

100% agree with that. This isn't needed per se, as there are multiple ways to write this already (e.g. first and only+unique), though I do find it conceptually simple. If there's any doubt, I'm happy to close this, I won't fight too hard for it. :slightly_smiling_face:

ararslan avatar Jan 07 '21 16:01 ararslan

Let us keep it open as it is appealing, to keep track of it in the future.

bkamins avatar Jan 07 '21 17:01 bkamins

You mean you find x = groupkey.x much cleaner than x = first(groupdf.x)? :-D If you count the additional do groupkey, then it's actually longer!

Following https://github.com/JuliaLang/julia/pull/39285, in Julia 1.7 it could actually be shorter 🙂

combine(pairs(groupby(df, x))) do (; x), groupdf
    # ...
end

knuesel avatar May 17 '21 09:05 knuesel