DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

describe(df, functions...) method

Open jtrakk opened this issue 3 years ago • 10 comments

Currently describe(df, :mean) is allowed but not describe(df, mean), which would allow users to provide their own describing functions.

jtrakk avatar Feb 21 '21 23:02 jtrakk

It is allowed to pass custom functions to describe. See the docstring:

stats::Union{Symbol, Pair}... : the summary statistics to report. Arguments can be: • A symbol from the list :mean, :std, :min, :q25, :median, :q75, :max, :eltype, :nunique, :first, :last, and :nmissing. The default statistics used are :mean, :min, :median, :max, :nmissing, and :eltype. • :all as the only Symbol argument to return all statistics. • A function => name pair where name is a Symbol or string. This will create a column of summary statistics with the provided name.

bkamins avatar Feb 22 '21 06:02 bkamins

Example:

  julia> df = DataFrame(i=1:10, x=0.1:0.1:1.0, y='a':'j');

  julia> describe(df, :min, sum => :sum)
  3×3 DataFrame
   Row │ variable  min  sum
       │ Symbol    Any  Any
  ─────┼────────────────────
     1 │ i         1    55
     2 │ x         0.1  5.5
     3 │ y         a

bkamins avatar Feb 22 '21 06:02 bkamins

I want to pass a function describe(df, mean) and have it infer the name mean automatically.

jtrakk avatar Feb 22 '21 06:02 jtrakk

This can be considered in the future

bkamins avatar Feb 22 '21 09:02 bkamins

In fact, it might be worth considering dropping the describe(df, symbol) method. The disadvantage is breaking backward compatibility and making the user type using Statistics at the top. The advantage is not needing to special-case certain function symbols.

jtrakk avatar Feb 26 '21 23:02 jtrakk

Some allowed symbols do not have a corresponding one-argument function.

bkamins avatar Feb 26 '21 23:02 bkamins

For reference, the current list is :mean, :std, :min, :q25, :median, :q75, :max, :eltype, :nunique, :first, :last, :nmissing. The ones without functions are q25, q75, nunique, nmissing.

jtrakk avatar Feb 26 '21 23:02 jtrakk

Also note that for the other functions it is actually e.g. mean∘skipmissing not just mean.

bkamins avatar Feb 26 '21 23:02 bkamins

I am reading the documentation, and now by default, using describe(df) function, you can get mean, median, min, max, nmissing. Are you considering to addstd, or percentiles (25%, 75%) similarly to how Pandas describes function works?

indymnv avatar Feb 21 '22 12:02 indymnv

No, because then the output is usually to wide to fit the standard screen size and gets cropped.

If you want to see more statistics and your screen is wide enough (or you are working in Jupyter Notebook with a proper output settings) then use :detailed or :all argument to see more statistics:

julia> df = DataFrame(i=1:10, x=0.1:0.1:1.0, y='a':'j');

julia> describe(df)
3×7 DataFrame
 Row │ variable  mean    min  median  max  nmissing  eltype
     │ Symbol    Union…  Any  Union…  Any  Int64     DataType
─────┼────────────────────────────────────────────────────────
   1 │ i         5.5     1    5.5     10          0  Int64
   2 │ x         0.55    0.1  0.55    1.0         0  Float64
   3 │ y                 a            j           0  Char

julia> describe(df, :all)
3×13 DataFrame
 Row │ variable  mean    std       min  q25     median  q75     max  nunique  nmissing  first  last  eltype
     │ Symbol    Union…  Union…    Any  Union…  Union…  Union…  Any  Union…   Int64     Any    Any   DataType
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ i         5.5     3.02765   1    3.25    5.5     7.75    10                   0  1      10    Int64
   2 │ x         0.55    0.302765  0.1  0.325   0.55    0.775   1.0                  0  0.1    1.0   Float64
   3 │ y                           a                            j    10              0  a      j     Char

julia> describe(df, :detailed)
3×11 DataFrame
 Row │ variable  mean    std       min  q25     median  q75     max  nunique  nmissing  eltype
     │ Symbol    Union…  Union…    Any  Union…  Union…  Union…  Any  Union…   Int64     DataType
─────┼───────────────────────────────────────────────────────────────────────────────────────────
   1 │ i         5.5     3.02765   1    3.25    5.5     7.75    10                   0  Int64
   2 │ x         0.55    0.302765  0.1  0.325   0.55    0.775   1.0                  0  Float64
   3 │ y                           a                            j    10              0  Char

(or just list the statistics you want to see)

bkamins avatar Feb 21 '22 12:02 bkamins