DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Make `describe` work for a grouped dataframe

Open pdeffebach opened this issue 7 years ago • 26 comments

Reporting this to remind myself to make this PR.

Since GroupedDataFraemes seem to support broadcasting, this should be easy.

pdeffebach avatar Jul 08 '18 03:07 pdeffebach

df = DataFrame(rand(5,5))
df[:gender] = ["m", "m", "m", "f", "f"]
gd = groupby(df, :gender)

I don't get how broadcasting works with grouped data frames. I was under the impression that describe would just work with a broadcast.

println.(gd) # works as expected
describe.(gd)
> ERROR: MethodError: no method matching describe(::DataFrames.GroupedDataFrame)

but describe(gd[1]) works, since gd[1] is a subdataframe.

pdeffebach avatar Jul 08 '18 03:07 pdeffebach

Using by is a bit weird too

> by(df, :gender, describe)
│ Row │ gender │ variable │ mean     │ min       │ median    │ max      │ nunique │ nmissing │ eltype  │
├─────┼────────┼──────────┼──────────┼───────────┼───────────┼──────────┼─────────┼──────────┼─────────┤
│ 1   │ m      │ x1       │ 0.334892 │ 0.0531118 │ 0.0600073 │ 0.891557 │ nothing │ nothing  │ Float64 │
│ 2   │ m      │ x2       │ 0.484557 │ 0.351354  │ 0.424645  │ 0.677673 │ nothing │ nothing  │ Float64 │
│ 3   │ m      │ x3       │ 0.561228 │ 0.0260729 │ 0.775023  │ 0.882588 │ nothing │ nothing  │ Float64 │
│ 4   │ m      │ x4       │ 0.41416  │ 0.0112837 │ 0.605188  │ 0.626008 │ nothing │ nothing  │ Float64 │
│ 5   │ m      │ x5       │ 0.596377 │ 0.454716  │ 0.573004  │ 0.761409 │ nothing │ nothing  │ Float64 │
│ 6   │ m      │ gender   │ nothing  │ m         │ nothing   │ m        │ 1       │ nothing  │ String  │
│ 7   │ f      │ x1       │ 0.328554 │ 0.266763  │ 0.328554  │ 0.390345 │ nothing │ nothing  │ Float64 │
│ 8   │ f      │ x2       │ 0.712208 │ 0.524521  │ 0.712208  │ 0.899895 │ nothing │ nothing  │ Float64 │
│ 9   │ f      │ x3       │ 0.305259 │ 0.253505  │ 0.305259  │ 0.357012 │ nothing │ nothing  │ Float64 │
│ 10  │ f      │ x4       │ 0.614933 │ 0.381118  │ 0.614933  │ 0.848749 │ nothing │ nothing  │ Float64 │
│ 11  │ f      │ x5       │ 0.206904 │ 0.145303  │ 0.206904  │ 0.268505 │ nothing │ nothing  │ Float64 │
│ 12  │ f      │ gender   │ nothing  │ f         │ nothing   │ f        │ 1       │ nothing  │ String  │

pdeffebach avatar Jul 08 '18 04:07 pdeffebach

map(describe, df) works fine, but for some reason it does some extra printing of each subdataframe. not sure why it does this.

pdeffebach avatar Jul 08 '18 04:07 pdeffebach

That's because map returns a GroupApplied object, which can then be combined back to a single data frame. The result of by(df, :gender, describe) is indeed a bit weird, because the grouping variable is passed to the function. I think there have been repeated complaints about this, maybe it should be optional. Not sure which default is best, we should compare with dplyr and Pandas.

nalimilan avatar Jul 09 '18 20:07 nalimilan

R

The only solution I could find for R is to use a do command. Maybe this is better with purrr but I've actually never taken the time to learn purrr.

Simply calling summarize on a grouped dataframe doesn't alter behavior.

t = df %>% group_by(Species) %>% do(summary = summary(.))
> Source: local data frame [3 x 2]
Groups: <by row>

# A tibble: 3 x 2
  Species    summary    
* <fct>      <list>     
1 setosa     <S3: table>
2 versicolor <S3: table>
3 virginica  <S3: table>

t$summary
> [[1]]
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0  
 Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0  
 Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                  
 3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                  
 Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600                  

[[2]]
  Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
 Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
 1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
 Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
 Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
 3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
 Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  

[[3]]
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400   setosa    : 0  
 1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800   versicolor: 0  
 Median :6.500   Median :3.000   Median :5.550   Median :2.000   virginica :50  
 Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026                  
 3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300                  
 Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500      

Man R loves lists.

Pandas

Pandas is weird. iris.describe() looks nice:

>>> iris.describe()
       Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

But when you group it, the output becomes unreadable. It prints each variable horizontally.

 t = iris.groupby('Species')
>>> t.describe()
           Petal.Length                                              Petal.Width        ...  Sepal.Length      Sepal.Width
                  count   mean       std  min  25%   50%    75%  max       count   mean ...           75%  max       count   mean       std  min    25%  50%    75%  max
Species                                                                                 ...
setosa             50.0  1.462  0.173664  1.0  1.4  1.50  1.575  1.9        50.0  0.246 ...           5.2  5.8        50.0  3.428  0.379064  2.3  3.200  3.4  3.675  4.4
versicolor         50.0  4.260  0.469911  3.0  4.0  4.35  4.600  5.1        50.0  1.326 ...           6.3  7.0        50.0  2.770  0.313798  2.0  2.525  2.8  3.000  3.4
virginica          50.0  5.552  0.551895  4.5  5.1  5.55  5.875  6.9        50.0  2.026 ...           6.9  7.9        50.0  2.974  0.322497  2.2  2.800  3.0  3.175  3.8

Thoughts

Returning a grouped dataframe:

│ Row │ gender |variable │ mean     │
├─────┼────────|─────────┼──────────┤
│ 1   │ M      |x1       │ 0.456352 │
│ 2   │ M      |x2       │ 0.441564 │
│ 3   │ M      |x3       │ 0.448726 │
│ 4   │ M      |x4       │ 0.456411 │
│ 5   │ M      |x5       │ 0.522709 │
│ 6   │ M      |sex      │     ?    │

│ Row │ gender |variable │ mean     │
├─────┼────────|─────────┼──────────┤
│ 1   │ F      |x1       │ 0.456352 │
│ 2   │ F      |x2       │ 0.441564 │
│ 3   │ F      |x3       │ 0.448726 │
│ 4   │ F      |x4       │ 0.456411 │
│ 5   │ F      |x5       │ 0.522709 │
│ 6   │ F      |sex      │     ?    │

This is weird because the :gender column's value isn't a property of the column :x1 so grouping the summary statistics doesn't really make sense.

We could just print 3 DataFrames with the annotation where :gender == "M" or something. But then we have different outputs for describe since describe returns a DataFrame while this new behavior would return Nothing.

pdeffebach avatar Jul 13 '18 17:07 pdeffebach

Thanks for checking! dplyr is indeed a bit different from us, since the grouping information can be present without necessarily changing the behavior of many operations. Pandas' behavior makes sense, but the printing is terrible.

How about returning a GroupedDataFrame with the same columns as the standard describe, plus the grouping column? Basically something similar to what by(df, :gender, describe) currently returns, but without the weird additional row for the grouping column.

nalimilan avatar Jul 13 '18 17:07 nalimilan

This is a good idea. And something I will work to implement.

I think that there are some changes I want to make before this, since grouped operations can be expensive, see my comment in #1256.

Without trying to re-optimize grouped operations just for this, it's worth changing summarize so that it only does the calculations it needs to, rather than calculating everything and just the statistics specified in stats.

pdeffebach avatar Jul 27 '18 14:07 pdeffebach

I'm going to implement this:

function describe(g::GroupedDataFrame; kwargs...)
       des = combine(map(d -> describe(d, kwargs...), g)) # a dataframe
       groupby(des, g.cols) # group it again
end

Though it isn't much use if Grouped DataFrames don't print more information. They currently cutoff pretty quickly. However I am under the impression we will be changing grouped dataframes printing in the future.

This behavior also might change with #1520

pdeffebach avatar Sep 26 '18 19:09 pdeffebach

Thanks. Actually, I wonder whether we shouldn't make this the standard behavior of map on GroupedDataFrame. At #1520 I suggested replacing combine with map, but it would also make sense to keep them as different operations, with map preserving the grouping information. Then instead of describe(gd), one would call map(describe, gd).

nalimilan avatar Sep 27 '18 10:09 nalimilan

I think thats the right move for sure. Two things.

  1. Does 1520 change the awkward printing of each subdataframe?
  2. Would there be any way to print something like
Group 1: :country == "United States" & :sex == "Female"

as the default behavior for any map operation?

pdeffebach avatar Sep 27 '18 12:09 pdeffebach

  1. No, not at all. But can you develop what you find awkward in particular?
  2. Yes, that sounds like a good idea. Feel free to experiment with that (the text should probably be cut if it's too long to fit on screen).

nalimilan avatar Sep 27 '18 12:09 nalimilan

Regarding this issue, I'll update #1520 to keep combine, and we can come back to it after that.

nalimilan avatar Sep 27 '18 12:09 nalimilan

This is the current output of map(describe, g::GroupedDataFrame)

julia> map(describe, g)
GroupApplied{DataFrame}(GroupedDataFrame with 3 groups based on keys: :A, :B
First Group: 2 rows
│ Row │ a     │ b        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.121894 │
│ 2   │ 1     │ 0.427624 │
⋮
Last Group: 2 rows
│ Row │ a     │ b        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 3     │ 0.280238 │
│ 2   │ 3     │ 0.746962 │, DataFrame[2×8 DataFrame. Omitted printing of 2 columns
│ Row │ variable │ mean     │ min      │ median   │ max      │ nunique │
│     │ Symbol   │ Float64  │ Real     │ Float64  │ Real     │ Nothing │
├─────┼──────────┼──────────┼──────────┼──────────┼──────────┼─────────┤
│ 1   │ a        │ 1.0      │ 1        │ 1.0      │ 1        │         │
│ 2   │ b        │ 0.274759 │ 0.121894 │ 0.274759 │ 0.427624 │         │, 2×8 DataFrame. Omitted printing of 2 columns
│ Row │ variable │ mean     │ min      │ median   │ max      │ nunique │
│     │ Symbol   │ Float64  │ Real     │ Float64  │ Real     │ Nothing │
├─────┼──────────┼──────────┼──────────┼──────────┼──────────┼─────────┤
│ 1   │ a        │ 2.0      │ 2        │ 2.0      │ 2        │         │
│ 2   │ b        │ 0.401019 │ 0.199386 │ 0.401019 │ 0.602652 │         │, 2×8 DataFrame. Omitted printing of 2 columns
│ Row │ variable │ mean    │ min      │ median  │ max      │ nunique │
│     │ Symbol   │ Float64 │ Real     │ Float64 │ Real     │ Nothing │
├─────┼──────────┼─────────┼──────────┼─────────┼──────────┼─────────┤
│ 1   │ a        │ 3.0     │ 3        │ 3.0     │ 3        │         │
│ 2   │ b        │ 0.5136  │ 0.280238 │ 0.5136  │ 0.746962 │         │])

I think it could omit the printing of g itself. and add some line breaks in between output (probably linebreaks and some info about the group.)

I think that printing info about the group might be tough though. The only info that a GroupedDataFrame stores is a hash and row numbers. So it would need some exploratory work done.

pdeffebach avatar Sep 27 '18 12:09 pdeffebach

I think that printing info about the group might be tough though. The only info that a GroupedDataFrame stores is a hash and row numbers. So it would need some exploratory work done.

Yes, but you know that all values for grouping columns are by definition equal for a given group, so you can just take the first one.

The result of map is indeed nonsense, but that's a problem with GroupApplied, which should be removed.

nalimilan avatar Sep 27 '18 13:09 nalimilan

Cool. Do you think you could bundle the printing PR with #1520? Or should we wait for that to be merged and see what to do.

fwiw, describe.(g) actually works quite well, other than being memory intensive. The only improvement we would conceivably make would be to have map be an iterator and print out the grouping better.

pdeffebach avatar Sep 27 '18 15:09 pdeffebach

I'd rather keep #1520 self-contained, printing is totally orthogonal to it.

nalimilan avatar Sep 27 '18 15:09 nalimilan

Sounds good when that is merged I will make a PR for printing of map and then we can tell users to use map for describe.

pdeffebach avatar Sep 27 '18 15:09 pdeffebach

After #1520 and others, map doesn't have that awkward printing.

df = DataFrame(a = repeat(1:4, outer = 5), b = randn(20), c = randn(20) .+ 1)
g = groupby(df, :a)
julia> map(describe, g)
GroupedDataFrame{DataFrame} with 4 groups based on key: :a
First Group: 3 rows
│ Row │ a     │ variable │ mean       │ min       │ median   │ max     │ nunique │ nmissing │ eltype   │
│     │ Int64 │ Symbol   │ Float64    │ Real      │ Float64  │ Real    │ Nothing │ Nothing  │ DataType │
├─────┼───────┼──────────┼────────────┼───────────┼──────────┼─────────┼─────────┼──────────┼──────────┤
│ 1   │ 1     │ a        │ 1.0        │ 1         │ 1.0      │ 1       │         │          │ Int64    │
│ 2   │ 1     │ b        │ 0.00779658 │ -1.4079   │ 0.365603 │ 1.78441 │         │          │ Float64  │
│ 3   │ 1     │ c        │ 0.702246   │ -0.164928 │ 0.783903 │ 1.67254 │         │          │ Float64  │
⋮
Last Group: 3 rows
│ Row │ a     │ variable │ mean     │ min       │ median   │ max     │ nunique │ nmissing │ eltype   │
│     │ Int64 │ Symbol   │ Float64  │ Real      │ Float64  │ Real    │ Nothing │ Nothing  │ DataType │
├─────┼───────┼──────────┼──────────┼───────────┼──────────┼─────────┼─────────┼──────────┼──────────┤
│ 1   │ 4     │ a        │ 4.0      │ 4         │ 4.0      │ 4       │         │          │ Int64    │
│ 2   │ 4     │ b        │ 0.309454 │ -0.668419 │ 0.244776 │ 1.49806 │         │          │ Float64  │
│ 3   │ 4     │ c        │ 0.949839 │ 0.286902  │ 1.19766  │ 1.52525 │         │          │ Float64  │

This is pretty good. Having the result be a grouped dataframe feels intuitive. However

  1. It isn't clear which sub-dataframe corresponds to which values in our grouping columns. Though this is a more general problem for grouped dataframes
  2. Default grouped dataframe printing only shows the first and last groups. This isn't very useful for looking at summary statistics.

pdeffebach avatar Dec 12 '18 21:12 pdeffebach

We could probably print the values of the grouping columns next to the group number. We could also try to print as many groups as possible on screen, but that wouldn't be enough in most cases I guess.

Note that you can also do combine(describe, g) to get a single data frame.

nalimilan avatar Dec 12 '18 21:12 nalimilan

#1632 Addresses the general grouped printing function.

showall(map(describe, g)) has the correct behavior. The questions is whether it's too verbose for recommended use.

pdeffebach avatar Dec 12 '18 21:12 pdeffebach

Now that #1632 has been merged, I wonder if we should change the way grouped DataFrames are printed in general.

Should we default to showing 10 rows of however many groups we can? I'm not sure I see the benefit of showing the first and last groups only.

pdeffebach avatar Jan 03 '19 18:01 pdeffebach

Ideally I think we should print the header only once, using a fixed width across groups for a given column. That would free a lot of space, and then it would make sense to print as many groups as possible.

nalimilan avatar Jan 03 '19 22:01 nalimilan

That's what TexTables.jl does here with summarize_by.

pdeffebach avatar Jan 09 '19 16:01 pdeffebach

I've been playing around with describe for a grouped data frame and this is the best one-liner

It's still not pretty, maybe when metadata is added we can have something that controls printing to make this better?

julia> df = DataFrame(a = rand(1:5, 1000), b = rand(1000));

julia> combine(describe, groupby(df, "a"), ungroup = false) |> (t -> show(t, allrows = true, allgroups = true))
GroupedDataFrame with 5 groups based on key: a
Group 1 (2 rows): a = 2
│ Row │ a     │ variable │ mean     │ min       │ median  │ max      │ nunique │ nmissing │ eltype   │
│     │ Int64 │ Symbol   │ Float64  │ Real      │ Float64 │ Real     │ Nothing │ Nothing  │ DataType │
├─────┼───────┼──────────┼──────────┼───────────┼─────────┼──────────┼─────────┼──────────┼──────────┤
│ 1   │ 2     │ a        │ 2.0      │ 2         │ 2.0     │ 2        │         │          │ Int64    │
│ 2   │ 2     │ b        │ 0.511089 │ 0.0032423 │ 0.51914 │ 0.994888 │         │          │ Float64  │
Group 2 (2 rows): a = 3
│ Row │ a     │ variable │ mean     │ min        │ median   │ max      │ nunique │ nmissing │ eltype   │
│     │ Int64 │ Symbol   │ Float64  │ Real       │ Float64  │ Real     │ Nothing │ Nothing  │ DataType │
├─────┼───────┼──────────┼──────────┼────────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│ 1   │ 3     │ a        │ 3.0      │ 3          │ 3.0      │ 3        │         │          │ Int64    │
│ 2   │ 3     │ b        │ 0.508742 │ 0.00174766 │ 0.541555 │ 0.997982 │         │          │ Float64  │
Group 3 (2 rows): a = 1
│ Row │ a     │ variable │ mean     │ min       │ median   │ max      │ nunique │ nmissing │ eltype   │
│     │ Int64 │ Symbol   │ Float64  │ Real      │ Float64  │ Real     │ Nothing │ Nothing  │ DataType │
├─────┼───────┼──────────┼──────────┼───────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│ 1   │ 1     │ a        │ 1.0      │ 1         │ 1.0      │ 1        │         │          │ Int64    │
│ 2   │ 1     │ b        │ 0.532574 │ 0.0107663 │ 0.560414 │ 0.994937 │         │          │ Float64  │
Group 4 (2 rows): a = 5
│ Row │ a     │ variable │ mean     │ min        │ median   │ max      │ nunique │ nmissing │ eltype   │
│     │ Int64 │ Symbol   │ Float64  │ Real       │ Float64  │ Real     │ Nothing │ Nothing  │ DataType │
├─────┼───────┼──────────┼──────────┼────────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│ 1   │ 5     │ a        │ 5.0      │ 5          │ 5.0      │ 5        │         │          │ Int64    │
│ 2   │ 5     │ b        │ 0.496365 │ 4.74044e-5 │ 0.495809 │ 0.990004 │         │          │ Float64  │
Group 5 (2 rows): a = 4
│ Row │ a     │ variable │ mean     │ min       │ median   │ max     │ nunique │ nmissing │ eltype   │
│     │ Int64 │ Symbol   │ Float64  │ Real      │ Float64  │ Real    │ Nothing │ Nothing  │ DataType │
├─────┼───────┼──────────┼──────────┼───────────┼──────────┼─────────┼─────────┼──────────┼──────────┤
│ 1   │ 4     │ a        │ 4.0      │ 4         │ 4.0      │ 4       │         │          │ Int64    │
│ 2   │ 4     │ b        │ 0.506765 │ 0.0109599 │ 0.504049 │ 0.99936 │         │          │ Float64  │

pdeffebach avatar Jun 19 '20 15:06 pdeffebach

We could improve GroupDataFrame printing to avoid repeating columns and use the same column width for all groups.

nalimilan avatar Jun 19 '20 20:06 nalimilan

I made a novice attempt at writing a work-around a few months ago. It was heavily inspired on the qsu() function in the collapse library.

R Code

# Panel data statistics: overall, on group-means and group-centered data
qsu(iris, pid = Sepal.Length ~ Species, higher = TRUE)
##          N/T    Mean      SD     Min     Max     Skew    Kurt
## Overall  150  5.8433  0.8281     4.3     7.9   0.3118  2.4264
## Between    3  5.8433  0.7951   5.006   6.588  -0.2112     1.5
## Within    50  5.8433  0.5113  4.1553  7.1553   0.1187  3.2633

julia Code

# Inspired from Collapse library's qsu() function in (R).
using DataFrames, StatsBase

function QSU(df::AbstractDataFrame, var::Symbol, group::Symbol)
    gdf = groupby(df, group)

    N = Int[]
    Mode = Any[]
    Median = Float64[]
    Mean = Float64[]
    SD = Float64[]
    Min = Float64[]
    Max = Float64[]
    IQR = Float64[]
    Kurt = Float64[]

    for g in gdf
        values = g[:, var]
        push!(N, length(values))
        push!(Mode, mode(values))
        push!(Median, median(values))
        push!(Mean, mean(values))
        push!(SD, std(values))
        push!(Min, minimum(values))
        push!(Max, maximum(values))
        push!(IQR, iqr(values))
        push!(Kurt, kurtosis(values))
    end

    result = DataFrame(
        group = [key[group] for key in keys(gdf)],
        N = N,
        Mode = Mode,
        Median = Median,
        Mean = Mean,
        SD = SD,
        Min = Min,
        Max = Max,
        IQR = IQR,
        Kurt = Kurt)

    return sort(result, :group)
end

Maybe this can be useful.

Edit:

Took some time to look into other methods and found combine() can achieve similar outputs -- although less efficient in benchmarking.

using RDatasets
iris = dataset("datasets", "iris")
gdf = groupby(iris, :Species)

@btime combine(gdf, nrow => :n, :PetalLength .=>
    [mode, median, mean, std, minimum, maximum, iqr, kurtosis] .=>
    ["mode", "median", "mean", "std", "minimum", "maximum", "iqr", "kurt"])
# 114.250 μs (988 allocations: 53.80 KiB)

@btime QSU(iris, :SepalLength, :Species);
# 15.932 μs (299 allocations: 26.70 KiB)

AuRobinson avatar Aug 30 '25 19:08 AuRobinson