StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

[Documentation] Missing Documentation for the Algorithms behind Histogram Fit

Open RoyiAvital opened this issue 2 years ago • 4 comments

It seems that StatsAPI.fit method can accept number of bins and calculate the edges internally according to the data. Yet it doesn't specify the algorithms used for that.

I may suggest to add the description of the algorithm. NumPy and MATLAB do specify and let the user chose the algorithm in that case.

By the way, it would be great to have an option to specify closed with the options: :both and :neither it won't create a valid histogram yet it is useful in many cases of data analysis.

RoyiAvital avatar Aug 12 '22 15:08 RoyiAvital

Looking at the code, it seems that we use the Sturges formula to choose the number of bins: https://github.com/JuliaStats/StatsBase.jl/blob/bd4ca61f4bb75f2c6cd0a47aee1cfde7b696eb9c/src/hist.jl#L356-L357

Does that match what you see in NumPy and MATLAB (and R AFAICT)?

(The closed issue is separate, not sure whether it would be possible/good without a dedicated discussion.)

nalimilan avatar Aug 24 '22 07:08 nalimilan

I think the algorithm should be exposed and then let others add algorithms to match their favorites.

It seems the discussions are not enabled in this repository. So where should the discussion on closed should happen?

RoyiAvital avatar Aug 24 '22 07:08 RoyiAvital

Can you make a second, new issue where you describe the problems you encounter with closed? Thank you

mschauer avatar Aug 24 '22 08:08 mschauer

A PR to add support for more methods to compute the number of bins would be welcome. We could accept both symbols (for standard methods) and arbitrary functions taking the data as input and returning edges.

For now I've just added details to document the current behavior at https://github.com/JuliaStats/StatsBase.jl/pull/829.

nalimilan avatar Aug 25 '22 19:08 nalimilan

Side note for the future--besides documenting this, there's better rules than the Sturges rule nowadays, which tends to suggest far too few bins. More modern rules suggest O(n^(1/3)) bins. I believe the default choice in R is the Sturges rule, which might be where this was taken from. ggplot2 uses 30 bins (!), regardless of sample size.

ParadaCarleton avatar Jun 21 '23 17:06 ParadaCarleton