SplitApplyCombine.jl icon indicating copy to clipboard operation
SplitApplyCombine.jl copied to clipboard

Allow `group` to take in an `AbstractVector` of groups?

Open pdeffebach opened this issue 5 years ago • 8 comments

Something like

g = [1, 1, 2, 2]
x = [5, 6, 7, 8]
group(g, x)

pdeffebach avatar Oct 30 '20 19:10 pdeffebach

Yes I think this is a good idea, though we need to be careful that dispatch works out.

I also thought we might have had something like this? (Perhaps it’s the internal function).

andyferris avatar Nov 02 '20 11:11 andyferris

I don't feel that strongly about it. It was just a surprising omission because without this there is no exact equivelent to a tapply call from R

pdeffebach avatar Nov 02 '20 18:11 pdeffebach

Hi @pdeffebach,

I finally got some time at the computer and see we already have this behavior:

julia> g = [1, 1, 2, 2]
4-element Array{Int64,1}:
 1
 1
 2
 2

julia> x = [5, 6, 7, 8]
4-element Array{Int64,1}:
 5
 6
 7
 8

julia> group(g, x)
2-element Dictionaries.Dictionary{Int64,Array{Int64,1}}
 1 │ [5, 6]
 2 │ [7, 8]

Is this what you were expecting?

andyferris avatar Nov 03 '20 22:11 andyferris

Regarding R's tapply if you want to apply fun to each group you can do fun.(group(g, x)) (or sometimes fun.(groupview(g, x)) might be faster/less memory hungry, and there is always groupreduce like groupreduce(+, g, x)).

andyferris avatar Nov 03 '20 22:11 andyferris

Thanks for this.

One final question, is there a version of this for transform? I.e. "spread"-ing the result across a vector the same length as the inputs?

I've been doing data cleaning at the repl and not having to write out a full groupby... transform call in data frames would be nice

pdeffebach avatar Nov 15 '20 19:11 pdeffebach

I'm not sure what you are seeking? Is it this?

julia> g = [1, 1, 2, 2]
4-element Array{Int64,1}:
 1
 1
 2
 2

julia> x = [5, 6, 7, 8]
4-element Array{Int64,1}:
 5
 6
 7
 8

julia> groups = group(g, x)
2-element Dictionaries.Dictionary{Int64,Array{Int64,1}}
 1 │ [5, 6]
 2 │ [7, 8]

julia> map(x -> groups[x], g)
4-element Array{Array{Int64,1},1}:
 [5, 6]
 [5, 6]
 [7, 8]
 [7, 8]

andyferris avatar Nov 17 '20 03:11 andyferris

Sorry for forgetting about this thread. I think the infrastructure has almost what I want, but I would like this to be in one function (The package is called SplitApplyCombine after all)

julia> using Statistics, SplitApplyCombine;

julia> function applyby(f, g::AbstractVector, x::AbstractVector)
           groups = group(g, x)
           map(f, groups)
       end
applyby (generic function with 1 method)

julia> applyby(mean, [1, 1, 2, 2], [5, 6, 7, 8])
2-element Dictionaries.Dictionary{Int64, Float64}
 1 │ 5.5
 2 │ 7.5

This would be nice to have. For reference, my motivation is for supporting grouped operations inside DataFramesMeta's @with, where all columns are just the vectors, so we can't take advantage of any DataFrames machinery.

An added bonus on the above would be to allow multiple arguments, i.e. applyby(f, g, args...). Not sure how that would work but could be feasible.

pdeffebach avatar May 26 '21 16:05 pdeffebach

Out of general principles, it seems more optimal to have fewer general functions that easily compose (group + map in your example) compared to a larger number of specialized functions (applyby). I think this case would have an (almost) zero overhead if you use groupview instead of group. Maybe I'm missing something, but

map(mean, group([1, 1, 2, 2], [5, 6, 7, 8]))

already looks very short, intuitive and clear - when one knowns what map and group do.

aplavin avatar May 26 '21 17:05 aplavin