StatsBase.jl
StatsBase.jl copied to clipboard
Unexpected behaviour of Histogram when using nbins
Hello, I would like to report on the following issue. Suppose that you want to produce an histogram with a data set defined between, say, xmin and xmax, with an approximate number of nbins. One has two possible approaches.
- Compute the bins directly: e.g, bins =[xmin + i*dx for i in 0:nbins] (where dx = (xmax-xmin) / nbins
- Use directly nbis in the fit function.
Both methods yield inconsistent results: For example
xmin = -15.0 xmax = 15.0 nbins = 20 dx = (xmax - xmin) / nbins bins =[xmin + i * dx for i in 0:nbins] println(bins)
[-15.0, -13.5, -12.0, -10.5, -9.0, -7.5, -6.0, -4.5, -3.0, -1.5, 0.0, 1.5, 3.0, 4.5, 6.0, 7.5, 9.0, 10.5, 12.0, 13.5, 15.0]
h2 = fit(Histogram, xx, bins) length(h2.weights)
20
Where xx is the data. This method yields, as it should a vector of weights with the same size than nbins.
If you try now:
h1 = fit(Histogram, xx, nbins=20) length(h1.weights)
16
The number of bins is smaller than what the user asked. Inspecting the histograms one can see that the first case is built with the edges:
Histogram{Int64, 1, Tuple{Vector{Float64}}} edges: [-15.0, -13.5, -12.0, -10.5, -9.0, -7.5, -6.0, -4.5, -3.0, -1.5 … 1.5, 3.0, 4.5, 6.0, 7.5, 9.0, 10.5, 12.0, 13.5, 15.0] weights: [1122, 1274, 1556, 1933, 2642, 3574, 4642, 5625, 6560, 7092, 7057, 6718, 5672, 4597, 3604, 2637, 2025, 1519, 1261, 1155] closed: left isdensity: false
While for the second case, a range with an integer step has been used:
Histogram{Int64, 1, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}}}} edges: -16.0:2.0:16.0 weights: [718, 1678, 2184, 3031, 4490, 6377, 8215, 9327, 9390, 8334, 6320, 4575, 3075, 2135, 1654, 762] closed: left isdensity: false
Is there a reason for this? Computing the bins from the data and producing in both cases the same result should be straight forward, and would avoid that the use gets an histogram with a number of bins that can be quite different from what she expects.
Thanks a lot for the excellent work.
Duplicate of https://github.com/JuliaStats/StatsBase.jl/issues/410?