Make `describe` work for a grouped dataframe
Reporting this to remind myself to make this PR.
Since GroupedDataFraemes seem to support broadcasting, this should be easy.
df = DataFrame(rand(5,5))
df[:gender] = ["m", "m", "m", "f", "f"]
gd = groupby(df, :gender)
I don't get how broadcasting works with grouped data frames. I was under the impression that describe would just work with a broadcast.
println.(gd) # works as expected
describe.(gd)
> ERROR: MethodError: no method matching describe(::DataFrames.GroupedDataFrame)
but describe(gd[1]) works, since gd[1] is a subdataframe.
Using by is a bit weird too
> by(df, :gender, describe)
│ Row │ gender │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
├─────┼────────┼──────────┼──────────┼───────────┼───────────┼──────────┼─────────┼──────────┼─────────┤
│ 1 │ m │ x1 │ 0.334892 │ 0.0531118 │ 0.0600073 │ 0.891557 │ nothing │ nothing │ Float64 │
│ 2 │ m │ x2 │ 0.484557 │ 0.351354 │ 0.424645 │ 0.677673 │ nothing │ nothing │ Float64 │
│ 3 │ m │ x3 │ 0.561228 │ 0.0260729 │ 0.775023 │ 0.882588 │ nothing │ nothing │ Float64 │
│ 4 │ m │ x4 │ 0.41416 │ 0.0112837 │ 0.605188 │ 0.626008 │ nothing │ nothing │ Float64 │
│ 5 │ m │ x5 │ 0.596377 │ 0.454716 │ 0.573004 │ 0.761409 │ nothing │ nothing │ Float64 │
│ 6 │ m │ gender │ nothing │ m │ nothing │ m │ 1 │ nothing │ String │
│ 7 │ f │ x1 │ 0.328554 │ 0.266763 │ 0.328554 │ 0.390345 │ nothing │ nothing │ Float64 │
│ 8 │ f │ x2 │ 0.712208 │ 0.524521 │ 0.712208 │ 0.899895 │ nothing │ nothing │ Float64 │
│ 9 │ f │ x3 │ 0.305259 │ 0.253505 │ 0.305259 │ 0.357012 │ nothing │ nothing │ Float64 │
│ 10 │ f │ x4 │ 0.614933 │ 0.381118 │ 0.614933 │ 0.848749 │ nothing │ nothing │ Float64 │
│ 11 │ f │ x5 │ 0.206904 │ 0.145303 │ 0.206904 │ 0.268505 │ nothing │ nothing │ Float64 │
│ 12 │ f │ gender │ nothing │ f │ nothing │ f │ 1 │ nothing │ String │
map(describe, df) works fine, but for some reason it does some extra printing of each subdataframe. not sure why it does this.
That's because map returns a GroupApplied object, which can then be combined back to a single data frame. The result of by(df, :gender, describe) is indeed a bit weird, because the grouping variable is passed to the function. I think there have been repeated complaints about this, maybe it should be optional. Not sure which default is best, we should compare with dplyr and Pandas.
R
The only solution I could find for R is to use a do command. Maybe this is better with purrr but I've actually never taken the time to learn purrr.
Simply calling summarize on a grouped dataframe doesn't alter behavior.
t = df %>% group_by(Species) %>% do(summary = summary(.))
> Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 x 2
Species summary
* <fct> <list>
1 setosa <S3: table>
2 versicolor <S3: table>
3 virginica <S3: table>
t$summary
> [[1]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 versicolor: 0
Median :5.000 Median :3.400 Median :1.500 Median :0.200 virginica : 0
Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
[[2]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
[[3]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400 setosa : 0
1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800 versicolor: 0
Median :6.500 Median :3.000 Median :5.550 Median :2.000 virginica :50
Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
Man R loves lists.
Pandas
Pandas is weird. iris.describe() looks nice:
>>> iris.describe()
Sepal.Length Sepal.Width Petal.Length Petal.Width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
But when you group it, the output becomes unreadable. It prints each variable horizontally.
t = iris.groupby('Species')
>>> t.describe()
Petal.Length Petal.Width ... Sepal.Length Sepal.Width
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
Species ...
setosa 50.0 1.462 0.173664 1.0 1.4 1.50 1.575 1.9 50.0 0.246 ... 5.2 5.8 50.0 3.428 0.379064 2.3 3.200 3.4 3.675 4.4
versicolor 50.0 4.260 0.469911 3.0 4.0 4.35 4.600 5.1 50.0 1.326 ... 6.3 7.0 50.0 2.770 0.313798 2.0 2.525 2.8 3.000 3.4
virginica 50.0 5.552 0.551895 4.5 5.1 5.55 5.875 6.9 50.0 2.026 ... 6.9 7.9 50.0 2.974 0.322497 2.2 2.800 3.0 3.175 3.8
Thoughts
Returning a grouped dataframe:
│ Row │ gender |variable │ mean │
├─────┼────────|─────────┼──────────┤
│ 1 │ M |x1 │ 0.456352 │
│ 2 │ M |x2 │ 0.441564 │
│ 3 │ M |x3 │ 0.448726 │
│ 4 │ M |x4 │ 0.456411 │
│ 5 │ M |x5 │ 0.522709 │
│ 6 │ M |sex │ ? │
│ Row │ gender |variable │ mean │
├─────┼────────|─────────┼──────────┤
│ 1 │ F |x1 │ 0.456352 │
│ 2 │ F |x2 │ 0.441564 │
│ 3 │ F |x3 │ 0.448726 │
│ 4 │ F |x4 │ 0.456411 │
│ 5 │ F |x5 │ 0.522709 │
│ 6 │ F |sex │ ? │
This is weird because the :gender column's value isn't a property of the column :x1 so grouping the summary statistics doesn't really make sense.
We could just print 3 DataFrames with the annotation where :gender == "M" or something. But then we have different outputs for describe since describe returns a DataFrame while this new behavior would return Nothing.
Thanks for checking! dplyr is indeed a bit different from us, since the grouping information can be present without necessarily changing the behavior of many operations. Pandas' behavior makes sense, but the printing is terrible.
How about returning a GroupedDataFrame with the same columns as the standard describe, plus the grouping column? Basically something similar to what by(df, :gender, describe) currently returns, but without the weird additional row for the grouping column.
This is a good idea. And something I will work to implement.
I think that there are some changes I want to make before this, since grouped operations can be expensive, see my comment in #1256.
Without trying to re-optimize grouped operations just for this, it's worth changing summarize so that it only does the calculations it needs to, rather than calculating everything and just the statistics specified in stats.
I'm going to implement this:
function describe(g::GroupedDataFrame; kwargs...)
des = combine(map(d -> describe(d, kwargs...), g)) # a dataframe
groupby(des, g.cols) # group it again
end
Though it isn't much use if Grouped DataFrames don't print more information. They currently cutoff pretty quickly. However I am under the impression we will be changing grouped dataframes printing in the future.
This behavior also might change with #1520
Thanks. Actually, I wonder whether we shouldn't make this the standard behavior of map on GroupedDataFrame. At #1520 I suggested replacing combine with map, but it would also make sense to keep them as different operations, with map preserving the grouping information. Then instead of describe(gd), one would call map(describe, gd).
I think thats the right move for sure. Two things.
- Does 1520 change the awkward printing of each subdataframe?
- Would there be any way to print something like
Group 1: :country == "United States" & :sex == "Female"
as the default behavior for any map operation?
- No, not at all. But can you develop what you find awkward in particular?
- Yes, that sounds like a good idea. Feel free to experiment with that (the text should probably be cut if it's too long to fit on screen).
Regarding this issue, I'll update #1520 to keep combine, and we can come back to it after that.
This is the current output of map(describe, g::GroupedDataFrame)
julia> map(describe, g)
GroupApplied{DataFrame}(GroupedDataFrame with 3 groups based on keys: :A, :B
First Group: 2 rows
│ Row │ a │ b │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.121894 │
│ 2 │ 1 │ 0.427624 │
⋮
Last Group: 2 rows
│ Row │ a │ b │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 3 │ 0.280238 │
│ 2 │ 3 │ 0.746962 │, DataFrame[2×8 DataFrame. Omitted printing of 2 columns
│ Row │ variable │ mean │ min │ median │ max │ nunique │
│ │ Symbol │ Float64 │ Real │ Float64 │ Real │ Nothing │
├─────┼──────────┼──────────┼──────────┼──────────┼──────────┼─────────┤
│ 1 │ a │ 1.0 │ 1 │ 1.0 │ 1 │ │
│ 2 │ b │ 0.274759 │ 0.121894 │ 0.274759 │ 0.427624 │ │, 2×8 DataFrame. Omitted printing of 2 columns
│ Row │ variable │ mean │ min │ median │ max │ nunique │
│ │ Symbol │ Float64 │ Real │ Float64 │ Real │ Nothing │
├─────┼──────────┼──────────┼──────────┼──────────┼──────────┼─────────┤
│ 1 │ a │ 2.0 │ 2 │ 2.0 │ 2 │ │
│ 2 │ b │ 0.401019 │ 0.199386 │ 0.401019 │ 0.602652 │ │, 2×8 DataFrame. Omitted printing of 2 columns
│ Row │ variable │ mean │ min │ median │ max │ nunique │
│ │ Symbol │ Float64 │ Real │ Float64 │ Real │ Nothing │
├─────┼──────────┼─────────┼──────────┼─────────┼──────────┼─────────┤
│ 1 │ a │ 3.0 │ 3 │ 3.0 │ 3 │ │
│ 2 │ b │ 0.5136 │ 0.280238 │ 0.5136 │ 0.746962 │ │])
I think it could omit the printing of g itself. and add some line breaks in between output (probably linebreaks and some info about the group.)
I think that printing info about the group might be tough though. The only info that a GroupedDataFrame stores is a hash and row numbers. So it would need some exploratory work done.
I think that printing info about the group might be tough though. The only info that a
GroupedDataFramestores is a hash and row numbers. So it would need some exploratory work done.
Yes, but you know that all values for grouping columns are by definition equal for a given group, so you can just take the first one.
The result of map is indeed nonsense, but that's a problem with GroupApplied, which should be removed.
Cool. Do you think you could bundle the printing PR with #1520? Or should we wait for that to be merged and see what to do.
fwiw, describe.(g) actually works quite well, other than being memory intensive. The only improvement we would conceivably make would be to have map be an iterator and print out the grouping better.
I'd rather keep #1520 self-contained, printing is totally orthogonal to it.
Sounds good when that is merged I will make a PR for printing of map and then we can tell users to use map for describe.
After #1520 and others, map doesn't have that awkward printing.
df = DataFrame(a = repeat(1:4, outer = 5), b = randn(20), c = randn(20) .+ 1)
g = groupby(df, :a)
julia> map(describe, g)
GroupedDataFrame{DataFrame} with 4 groups based on key: :a
First Group: 3 rows
│ Row │ a │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Int64 │ Symbol │ Float64 │ Real │ Float64 │ Real │ Nothing │ Nothing │ DataType │
├─────┼───────┼──────────┼────────────┼───────────┼──────────┼─────────┼─────────┼──────────┼──────────┤
│ 1 │ 1 │ a │ 1.0 │ 1 │ 1.0 │ 1 │ │ │ Int64 │
│ 2 │ 1 │ b │ 0.00779658 │ -1.4079 │ 0.365603 │ 1.78441 │ │ │ Float64 │
│ 3 │ 1 │ c │ 0.702246 │ -0.164928 │ 0.783903 │ 1.67254 │ │ │ Float64 │
⋮
Last Group: 3 rows
│ Row │ a │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Int64 │ Symbol │ Float64 │ Real │ Float64 │ Real │ Nothing │ Nothing │ DataType │
├─────┼───────┼──────────┼──────────┼───────────┼──────────┼─────────┼─────────┼──────────┼──────────┤
│ 1 │ 4 │ a │ 4.0 │ 4 │ 4.0 │ 4 │ │ │ Int64 │
│ 2 │ 4 │ b │ 0.309454 │ -0.668419 │ 0.244776 │ 1.49806 │ │ │ Float64 │
│ 3 │ 4 │ c │ 0.949839 │ 0.286902 │ 1.19766 │ 1.52525 │ │ │ Float64 │
This is pretty good. Having the result be a grouped dataframe feels intuitive. However
- It isn't clear which sub-dataframe corresponds to which values in our grouping columns. Though this is a more general problem for grouped dataframes
- Default grouped dataframe printing only shows the first and last groups. This isn't very useful for looking at summary statistics.
We could probably print the values of the grouping columns next to the group number. We could also try to print as many groups as possible on screen, but that wouldn't be enough in most cases I guess.
Note that you can also do combine(describe, g) to get a single data frame.
#1632 Addresses the general grouped printing function.
showall(map(describe, g)) has the correct behavior. The questions is whether it's too verbose for recommended use.
Now that #1632 has been merged, I wonder if we should change the way grouped DataFrames are printed in general.
Should we default to showing 10 rows of however many groups we can? I'm not sure I see the benefit of showing the first and last groups only.
Ideally I think we should print the header only once, using a fixed width across groups for a given column. That would free a lot of space, and then it would make sense to print as many groups as possible.
That's what TexTables.jl does here with summarize_by.
I've been playing around with describe for a grouped data frame and this is the best one-liner
It's still not pretty, maybe when metadata is added we can have something that controls printing to make this better?
julia> df = DataFrame(a = rand(1:5, 1000), b = rand(1000));
julia> combine(describe, groupby(df, "a"), ungroup = false) |> (t -> show(t, allrows = true, allgroups = true))
GroupedDataFrame with 5 groups based on key: a
Group 1 (2 rows): a = 2
│ Row │ a │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Int64 │ Symbol │ Float64 │ Real │ Float64 │ Real │ Nothing │ Nothing │ DataType │
├─────┼───────┼──────────┼──────────┼───────────┼─────────┼──────────┼─────────┼──────────┼──────────┤
│ 1 │ 2 │ a │ 2.0 │ 2 │ 2.0 │ 2 │ │ │ Int64 │
│ 2 │ 2 │ b │ 0.511089 │ 0.0032423 │ 0.51914 │ 0.994888 │ │ │ Float64 │
Group 2 (2 rows): a = 3
│ Row │ a │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Int64 │ Symbol │ Float64 │ Real │ Float64 │ Real │ Nothing │ Nothing │ DataType │
├─────┼───────┼──────────┼──────────┼────────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│ 1 │ 3 │ a │ 3.0 │ 3 │ 3.0 │ 3 │ │ │ Int64 │
│ 2 │ 3 │ b │ 0.508742 │ 0.00174766 │ 0.541555 │ 0.997982 │ │ │ Float64 │
Group 3 (2 rows): a = 1
│ Row │ a │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Int64 │ Symbol │ Float64 │ Real │ Float64 │ Real │ Nothing │ Nothing │ DataType │
├─────┼───────┼──────────┼──────────┼───────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│ 1 │ 1 │ a │ 1.0 │ 1 │ 1.0 │ 1 │ │ │ Int64 │
│ 2 │ 1 │ b │ 0.532574 │ 0.0107663 │ 0.560414 │ 0.994937 │ │ │ Float64 │
Group 4 (2 rows): a = 5
│ Row │ a │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Int64 │ Symbol │ Float64 │ Real │ Float64 │ Real │ Nothing │ Nothing │ DataType │
├─────┼───────┼──────────┼──────────┼────────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│ 1 │ 5 │ a │ 5.0 │ 5 │ 5.0 │ 5 │ │ │ Int64 │
│ 2 │ 5 │ b │ 0.496365 │ 4.74044e-5 │ 0.495809 │ 0.990004 │ │ │ Float64 │
Group 5 (2 rows): a = 4
│ Row │ a │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Int64 │ Symbol │ Float64 │ Real │ Float64 │ Real │ Nothing │ Nothing │ DataType │
├─────┼───────┼──────────┼──────────┼───────────┼──────────┼─────────┼─────────┼──────────┼──────────┤
│ 1 │ 4 │ a │ 4.0 │ 4 │ 4.0 │ 4 │ │ │ Int64 │
│ 2 │ 4 │ b │ 0.506765 │ 0.0109599 │ 0.504049 │ 0.99936 │ │ │ Float64 │
We could improve GroupDataFrame printing to avoid repeating columns and use the same column width for all groups.
I made a novice attempt at writing a work-around a few months ago. It was heavily inspired on the qsu() function in the collapse library.
R Code
# Panel data statistics: overall, on group-means and group-centered data
qsu(iris, pid = Sepal.Length ~ Species, higher = TRUE)
## N/T Mean SD Min Max Skew Kurt
## Overall 150 5.8433 0.8281 4.3 7.9 0.3118 2.4264
## Between 3 5.8433 0.7951 5.006 6.588 -0.2112 1.5
## Within 50 5.8433 0.5113 4.1553 7.1553 0.1187 3.2633
julia Code
# Inspired from Collapse library's qsu() function in (R).
using DataFrames, StatsBase
function QSU(df::AbstractDataFrame, var::Symbol, group::Symbol)
gdf = groupby(df, group)
N = Int[]
Mode = Any[]
Median = Float64[]
Mean = Float64[]
SD = Float64[]
Min = Float64[]
Max = Float64[]
IQR = Float64[]
Kurt = Float64[]
for g in gdf
values = g[:, var]
push!(N, length(values))
push!(Mode, mode(values))
push!(Median, median(values))
push!(Mean, mean(values))
push!(SD, std(values))
push!(Min, minimum(values))
push!(Max, maximum(values))
push!(IQR, iqr(values))
push!(Kurt, kurtosis(values))
end
result = DataFrame(
group = [key[group] for key in keys(gdf)],
N = N,
Mode = Mode,
Median = Median,
Mean = Mean,
SD = SD,
Min = Min,
Max = Max,
IQR = IQR,
Kurt = Kurt)
return sort(result, :group)
end
Maybe this can be useful.
Edit:
Took some time to look into other methods and found combine() can achieve similar outputs -- although less efficient in benchmarking.
using RDatasets
iris = dataset("datasets", "iris")
gdf = groupby(iris, :Species)
@btime combine(gdf, nrow => :n, :PetalLength .=>
[mode, median, mean, std, minimum, maximum, iqr, kurtosis] .=>
["mode", "median", "mean", "std", "minimum", "maximum", "iqr", "kurt"])
# 114.250 μs (988 allocations: 53.80 KiB)
@btime QSU(iris, :SepalLength, :Species);
# 15.932 μs (299 allocations: 26.70 KiB)