StatsBase.jl
StatsBase.jl copied to clipboard
[Documentation] Missing Documentation for the Algorithms behind Histogram Fit
It seems that StatsAPI.fit
method can accept number of bins and calculate the edges internally according to the data.
Yet it doesn't specify the algorithms used for that.
I may suggest to add the description of the algorithm. NumPy and MATLAB do specify and let the user chose the algorithm in that case.
By the way, it would be great to have an option to specify closed
with the options: :both
and :neither
it won't create a valid histogram yet it is useful in many cases of data analysis.
Looking at the code, it seems that we use the Sturges formula to choose the number of bins: https://github.com/JuliaStats/StatsBase.jl/blob/bd4ca61f4bb75f2c6cd0a47aee1cfde7b696eb9c/src/hist.jl#L356-L357
Does that match what you see in NumPy and MATLAB (and R AFAICT)?
(The closed
issue is separate, not sure whether it would be possible/good without a dedicated discussion.)
I think the algorithm should be exposed and then let others add algorithms to match their favorites.
It seems the discussions are not enabled in this repository. So where should the discussion on closed
should happen?
Can you make a second, new issue where you describe the problems you encounter with closed
? Thank you
A PR to add support for more methods to compute the number of bins would be welcome. We could accept both symbols (for standard methods) and arbitrary functions taking the data as input and returning edges.
For now I've just added details to document the current behavior at https://github.com/JuliaStats/StatsBase.jl/pull/829.
Side note for the future--besides documenting this, there's better rules than the Sturges rule nowadays, which tends to suggest far too few bins. More modern rules suggest O(n^(1/3))
bins. I believe the default choice in R is the Sturges rule, which might be where this was taken from. ggplot2 uses 30 bins (!), regardless of sample size.