DataFrames.jl
DataFrames.jl copied to clipboard
describe(df, functions...) method
Currently describe(df, :mean)
is allowed but not describe(df, mean)
, which would allow users to provide their own describing functions.
It is allowed to pass custom functions to describe
. See the docstring:
stats::Union{Symbol, Pair}... : the summary statistics to report. Arguments can be: • A symbol from the list :mean, :std, :min, :q25, :median, :q75, :max, :eltype, :nunique, :first, :last, and :nmissing. The default statistics used are :mean, :min, :median, :max, :nmissing, and :eltype. • :all as the only Symbol argument to return all statistics. • A function => name pair where name is a Symbol or string. This will create a column of summary statistics with the provided name.
Example:
julia> df = DataFrame(i=1:10, x=0.1:0.1:1.0, y='a':'j');
julia> describe(df, :min, sum => :sum)
3×3 DataFrame
Row │ variable min sum
│ Symbol Any Any
─────┼────────────────────
1 │ i 1 55
2 │ x 0.1 5.5
3 │ y a
I want to pass a function describe(df, mean)
and have it infer the name mean
automatically.
This can be considered in the future
In fact, it might be worth considering dropping the describe(df, symbol)
method. The disadvantage is breaking backward compatibility and making the user type using Statistics
at the top. The advantage is not needing to special-case certain function symbols.
Some allowed symbols do not have a corresponding one-argument function.
For reference, the current list is :mean
, :std
, :min
, :q25
, :median
, :q75
, :max
, :eltype
, :nunique
, :first
, :last
, :nmissing
. The ones without functions are q25
, q75
, nunique
, nmissing
.
Also note that for the other functions it is actually e.g. mean∘skipmissing
not just mean
.
I am reading the documentation, and now by default, using describe(df) function, you can get mean
, median
, min
, max
, nmissing
. Are you considering to addstd
, or percentiles (25%, 75%) similarly to how Pandas describes function works?
No, because then the output is usually to wide to fit the standard screen size and gets cropped.
If you want to see more statistics and your screen is wide enough (or you are working in Jupyter Notebook with a proper output settings) then use :detailed
or :all
argument to see more statistics:
julia> df = DataFrame(i=1:10, x=0.1:0.1:1.0, y='a':'j');
julia> describe(df)
3×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
─────┼────────────────────────────────────────────────────────
1 │ i 5.5 1 5.5 10 0 Int64
2 │ x 0.55 0.1 0.55 1.0 0 Float64
3 │ y a j 0 Char
julia> describe(df, :all)
3×13 DataFrame
Row │ variable mean std min q25 median q75 max nunique nmissing first last eltype
│ Symbol Union… Union… Any Union… Union… Union… Any Union… Int64 Any Any DataType
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ i 5.5 3.02765 1 3.25 5.5 7.75 10 0 1 10 Int64
2 │ x 0.55 0.302765 0.1 0.325 0.55 0.775 1.0 0 0.1 1.0 Float64
3 │ y a j 10 0 a j Char
julia> describe(df, :detailed)
3×11 DataFrame
Row │ variable mean std min q25 median q75 max nunique nmissing eltype
│ Symbol Union… Union… Any Union… Union… Union… Any Union… Int64 DataType
─────┼───────────────────────────────────────────────────────────────────────────────────────────
1 │ i 5.5 3.02765 1 3.25 5.5 7.75 10 0 Int64
2 │ x 0.55 0.302765 0.1 0.325 0.55 0.775 1.0 0 Float64
3 │ y a j 10 0 Char
(or just list the statistics you want to see)