daru
daru copied to clipboard
Aggregate DataFrame via Summarise Function [HELP Wanted]
Hi, Daru community.
I was trying to find a simple function how DataFrame can be summarized using customisable aggregation function for each new Vector, but can't find any flexible solution.
Sometimes you need to apply different aggregations Idea comes from R dplyr where you can run summarise on grouped data.
Here is short example which I think is mostly obvious on it self. It allow in quick to run different aggregations
df => #<Daru::DataFrame(8x4)>
a b c d
0 foo one 1 11
1 bar one 2 22
2 foo two 3 33
3 bar three 1 44
4 foo two 3 55
5 bar two 6 66
6 foo one 3 77
7 foo three 8 88
#proposed notation
summary = df.group_by(:a).summarise_with(
avg_d: [:mean,:d],
sum_c: [:sum,:c],
avg_of_c: [:mean,:c],
size_b_with_lambda: ->(grouped){ grouped[:b].size},
uniq_b_with_proc: proc {|grouped| grouped[:b].uniq.size }
)
#Result
=> #<Daru::DataFrame(2x5)>
avg_d sum_c avg_of_c size_b_wit uniq_b_wit
bar 44.0 9 3.0 3 3
foo 52.8 18 3.6 5 3
I also realised that in piece of code, but not sure if this function is not yet exists somewhere in Daru.
class Daru::Core::GroupBy
def summarise_with(**aggregations)
super_hash = groups.map {|n, _| [n, {}]}.to_h
groups.keys.each do |group_name|
group_data = get_group(group_name)
aggregations.each do |new_vector, opts|
aggregation, vector = Array(opts)
to_aggregate = group_data.has_vector?(vector) ? group_data[vector] : group_data
super_hash[group_name][new_vector] = if aggregation.is_a?(Proc)
aggregation.call(to_aggregate)
else
to_aggregate.send(aggregation)
end
end
end
Daru::DataFrame.new(super_hash.values, index: super_hash.keys)
end
end
@vanitu agreed
Hi,
I have an Dataset class that is a wrapper to Daru and delegates to Daru much of the time. This allows me to create business specific transforms that would not be suitable for open source. I wrote the following summarize transform. If folks like the API I could try to make an enhancement.
wafer_median_isats = dset.summarize(/isat/, group_by: [:lot_id, :waf_num], stats: :median)
The first argument is the columns to summarize, the group_by argument does just that and the stats argument can be a single stat or an array of stats (e.g. [:median, :mean]). The summarize method is as follows:
summarized_hash = Hash.new { |h, k| h[k] = [] }.tap do |sum_dset|
options[:stats].each do |statistic|
data_frame.group_by(group_columns).each_group do |dframe|
unless dframe[0].respond_to? statistic
puts "Cannot summarize by stat '#{statistic}'!"
fail
end
dframe.each_vector_with_index do |vec, col_name|
if group_columns.include? col_name
sum_dset[col_name] << vec[0]
elsif summarize_columns.include? col_name
sum_dset["#{col_name}_#{statistic}".to_sym] << vec.send(statistic).to_f.round(4)
end
end
end
end
end
I then just instantiate a new DataFrame using the hash of arrays 'summarized_hash'. Does this look like the most efficient way to create a statistical summary?
@vanitu I belive you should use DataFrame#aggregate
df.group_by(:a).aggregate(
avg_d: ->(df) { df[:d].mean },
sum_c: ->(df) { df[:c].sum },
avg_of_c: ->(df) { df[:c].mean },
size_b_with_lambda: ->(grouped){ grouped[:b].size},
uniq_b_with_proc: proc {|grouped| grouped[:b].uniq.size }
)
=> #<Daru::DataFrame(2x5)>
avg_d sum_c avg_of_c size_b_wit uniq_b_wit
bar 44.0 9 3.0 3 3
foo 52.8 18 3.6 5 3