StatsBase.jl
StatsBase.jl copied to clipboard
fit nbis=2 produces results of nbins=3
When fitting a histogram to data with nbins=2 I get the results corresponding to 3 bins. Same happens if I pass 0.0:0.5:1.5 and closed=:left instead. I get 3 bins.
julia> h = fit(Histogram, [0,1,0,0,1,1,1,1], nbins=2, closed=:right)
Histogram{Int64,1,Tuple{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}
edges:
-0.5:0.5:1.0
weights: [3, 0, 5]
closed: right
isdensity: false
julia> h = fit(Histogram, [0,1,0,0,1,1,1,1], nbins=3, closed=:right)
Histogram{Int64,1,Tuple{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}
edges:
-0.5:0.5:1.0
weights: [3, 0, 5]
closed: right
isdensity: false
I guess this is a bug, otherwise how to avoid this issue?
I'm not familiar with the histogram code, but AFAICT that's expected, and I get the same thing with R. The number of bins is only a suggestion, and it's not always possible to achieve the requested number. Maybe the documentation needs to be more explicit (it already says "approximate number of bins").
Ok, I do not how R handles the issue.
Then, to specify a trivial 2 bin histogram, Is one forced to do something like the following?
h = fit(Histogram, [0,1,0,0,1,1,1,1], -0.02:0.51:1.0, closed=:right)
Why can the function not resolve nbins=2 in this case to any valid edges (the above just one out of an infinite)?
Just to revive an old issue or "feature": Having a function, guessing how many bins to chose if one do not provide anything, is useful. BUT if I do provide the number of bins, I want to be sure that I exactly get that number of bins in my histogram. What is the point of not having this behaviour?
I think this can lead to unnecessary grief for the users: imagine two parties trying to sync bin counts... cc @oschulz for perspective
I think we can make nbins=X
strictly enforced since in real life people are probably counting on nbins
being very closely realized anyways.
@Moelf , could you be a bit pro precise on your example? I dont understand how two parties are involved in plotting a histogram. Could give an example?
it's common in science/data science people who processed data independently want to verify the result are the same (sync).
In this case, a naive understanding of nbins
would lead to n
bins (for example, if someone compares to numpy.histogram
)
@Moelf , yes sure! I misunderstood your comment. You suggest nbins
specifying the exact number of bins.
I think we can make nbins=X strictly enforced
I fully agree: if the user specifies nbins = 2
, then they should get two bins. Getting 3 bins is surprising to the user, and a histogram with 2 bins (and even one bin with nbins = 1
) is a perfectly valid thing.
At the very least it would be interesting to start providing an argument to force using exactly the requested number of bins. That wouldn't be breaking. Then in the next breaking release we could change the default if we're happy with how the new argument works.
It seems to be convention to make 'nice' numbers for bin widths. I like the suggestion by @nalimilan to add a new argument like nbins_exact
. This could be implemented to match functionality between
h = fit(Histogram, x, range(minimum(x), stop=maximum(x), length=29))
and
h = fit(Histogram, x, nbins_exact=28)
Hm, having two competing nbins
kwargs makes things ambiguous, though. Maybe some kind of annotation like nbins = (100, :exact)
or so instead?
I believe the change is not breaking, reason: we said it's approximate anyways, which means
- we can change the heuristics
- one of the option is to use exact numbers as is :)
I think the cleanest solution would be to have binning algorithms, represented by types/structs that can then have parameters like exact
. @mkborregaard and me did add a several binning algorithms to Plots a long time ago (https://github.com/JuliaPlots/Plots.jl/blob/fd46fd4addd040c4e5b73b00e71109a46eea620e/src/recipes.jl#L789), but specified as keywords. This had always been intended to be an interim solution until we added them to StatsBase.Histogram
in a proper way (with types), we just never got around to it. Maybe it's really time now.
I actually have a histogram package because StatsBase seems reluctant to support more advanced usage such as:
hist1 / hist2
# thread-safe
push!(hist, val, weight)
It would be really nice to support the thread-safe push!
directly in StatsBase.Histogram
Yes, but I think the idea is this is "Base" histogram. Well, I'm happy to be part of StatsBase for sure since I'm already doing (<: AbstractHistogram
) but I don't know what the devs here think.
More advanced usage can maybe go in a new issue ... I just want my bin counts 🙃
See also https://github.com/JuliaStats/StatsBase.jl/pull/533.
I actually have a histogram package because StatsBase seems reluctant to support more advanced usage such as:
hist1 / hist2 # thread-safe push!(hist, val, weight)
Who said this? Actually at https://github.com/JuliaStats/StatsBase.jl/issues/650 I said almost the contrary. Improvements to histograms in StatsBase would be welcome (though I admit finding reviewers isn't always easy).
Oh wow @oschulz have those methods not been moved yet? Now is clearly the time :-)
You're so right @mkborregaard . We should do it dispatched-based, with binning algorithms structs (so they can, in principle, take parameters), not with keyword-symbols like in Plots. And it could tie in with #514 (still on my to-do list).
If the number of bins is approximated it would be good to document how it is approximated. Even better would be the possibility to select the approximation algorithm. This could then also have the option to get the exact number of bins.
A few options I found are:
- exact
- sturges
- rice
- scott
Yes, that's the kind of methods we have in Plot now and that we should move over to StatsBase (with algorithm structs, not keywords like in Plots).
Gentle reminder if schedules permit:
@oschulz ...
I think the cleanest solution would be to have binning algorithms, represented by types/structs that can then have parameters like exact. @mkborregaard and me did add a several binning algorithms to Plots a long time ago (https://github.com/JuliaPlots/Plots.jl/blob/fd46fd4addd040c4e5b73b00e71109a46eea620e/src/recipes.jl#L789), but specified as keywords. This had always been intended to be an interim solution until we added them to StatsBase.Histogram in a proper way (with types), we just never got around to it. Maybe it's really time now.
@mkborregaard ...
Oh wow @oschulz have those methods not been moved yet? Now is clearly the time :-)
@oschulz ...
You're so right @mkborregaard . We should do it dispatched-based, with binning algorithms structs (so they can, in principle, take parameters), not with keyword-symbols like in Plots. And it could tie in with #514 (still on my to-do list).
Thanks for the reminder! It's been on the back of my mind, I'll try to get on it soonish.
This code works now: h = fit(Histogram, x, range(minimum(x), stop=maximum(x), length=nbins)).
Hi @huixinzhang, I think the core of the discussion centers on the default response to using the nbins
keyword argument. The workaround using range()
is useful, but I believe nbins
should use the exact number of bins asked for if the code is to speak for itself!