StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

fit nbis=2 produces results of nbins=3

Open kakila opened this issue 6 years ago • 26 comments

When fitting a histogram to data with nbins=2 I get the results corresponding to 3 bins. Same happens if I pass 0.0:0.5:1.5 and closed=:left instead. I get 3 bins.

julia> h = fit(Histogram, [0,1,0,0,1,1,1,1], nbins=2, closed=:right)
Histogram{Int64,1,Tuple{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}
edges:
  -0.5:0.5:1.0
weights: [3, 0, 5]
closed: right
isdensity: false

julia> h = fit(Histogram, [0,1,0,0,1,1,1,1], nbins=3, closed=:right)
Histogram{Int64,1,Tuple{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}
edges:
  -0.5:0.5:1.0
weights: [3, 0, 5]
closed: right
isdensity: false

I guess this is a bug, otherwise how to avoid this issue?

kakila avatar Sep 08 '18 01:09 kakila

I'm not familiar with the histogram code, but AFAICT that's expected, and I get the same thing with R. The number of bins is only a suggestion, and it's not always possible to achieve the requested number. Maybe the documentation needs to be more explicit (it already says "approximate number of bins").

nalimilan avatar Sep 11 '18 12:09 nalimilan

Ok, I do not how R handles the issue.

Then, to specify a trivial 2 bin histogram, Is one forced to do something like the following?

h = fit(Histogram, [0,1,0,0,1,1,1,1], -0.02:0.51:1.0, closed=:right)

Why can the function not resolve nbins=2 in this case to any valid edges (the above just one out of an infinite)?

kakila avatar Sep 11 '18 21:09 kakila

Just to revive an old issue or "feature": Having a function, guessing how many bins to chose if one do not provide anything, is useful. BUT if I do provide the number of bins, I want to be sure that I exactly get that number of bins in my histogram. What is the point of not having this behaviour?

jamblejoe avatar Jul 29 '19 17:07 jamblejoe

I think this can lead to unnecessary grief for the users: imagine two parties trying to sync bin counts... cc @oschulz for perspective

I think we can make nbins=X strictly enforced since in real life people are probably counting on nbins being very closely realized anyways.

Moelf avatar Mar 03 '21 01:03 Moelf

@Moelf , could you be a bit pro precise on your example? I dont understand how two parties are involved in plotting a histogram. Could give an example?

jamblejoe avatar Mar 03 '21 07:03 jamblejoe

it's common in science/data science people who processed data independently want to verify the result are the same (sync).

In this case, a naive understanding of nbins would lead to n bins (for example, if someone compares to numpy.histogram)

Moelf avatar Mar 03 '21 07:03 Moelf

@Moelf , yes sure! I misunderstood your comment. You suggest nbins specifying the exact number of bins.

jamblejoe avatar Mar 03 '21 07:03 jamblejoe

I think we can make nbins=X strictly enforced

I fully agree: if the user specifies nbins = 2, then they should get two bins. Getting 3 bins is surprising to the user, and a histogram with 2 bins (and even one bin with nbins = 1) is a perfectly valid thing.

oschulz avatar Mar 03 '21 08:03 oschulz

At the very least it would be interesting to start providing an argument to force using exactly the requested number of bins. That wouldn't be breaking. Then in the next breaking release we could change the default if we're happy with how the new argument works.

nalimilan avatar Mar 03 '21 09:03 nalimilan

It seems to be convention to make 'nice' numbers for bin widths. I like the suggestion by @nalimilan to add a new argument like nbins_exact. This could be implemented to match functionality between

h = fit(Histogram, x, range(minimum(x), stop=maximum(x), length=29)) and h = fit(Histogram, x, nbins_exact=28)

thompsonmj avatar Jul 28 '21 18:07 thompsonmj

Hm, having two competing nbins kwargs makes things ambiguous, though. Maybe some kind of annotation like nbins = (100, :exact) or so instead?

oschulz avatar Jul 28 '21 19:07 oschulz

I believe the change is not breaking, reason: we said it's approximate anyways, which means

  • we can change the heuristics
  • one of the option is to use exact numbers as is :)

Moelf avatar Jul 28 '21 19:07 Moelf

I think the cleanest solution would be to have binning algorithms, represented by types/structs that can then have parameters like exact. @mkborregaard and me did add a several binning algorithms to Plots a long time ago (https://github.com/JuliaPlots/Plots.jl/blob/fd46fd4addd040c4e5b73b00e71109a46eea620e/src/recipes.jl#L789), but specified as keywords. This had always been intended to be an interim solution until we added them to StatsBase.Histogram in a proper way (with types), we just never got around to it. Maybe it's really time now.

oschulz avatar Jul 28 '21 19:07 oschulz

I actually have a histogram package because StatsBase seems reluctant to support more advanced usage such as:

hist1 / hist2

# thread-safe
push!(hist, val, weight)

Moelf avatar Jul 28 '21 21:07 Moelf

It would be really nice to support the thread-safe push! directly in StatsBase.Histogram

oschulz avatar Jul 29 '21 09:07 oschulz

Yes, but I think the idea is this is "Base" histogram. Well, I'm happy to be part of StatsBase for sure since I'm already doing (<: AbstractHistogram) but I don't know what the devs here think.

Moelf avatar Jul 29 '21 11:07 Moelf

More advanced usage can maybe go in a new issue ... I just want my bin counts 🙃

thompsonmj avatar Jul 29 '21 12:07 thompsonmj

See also https://github.com/JuliaStats/StatsBase.jl/pull/533.

I actually have a histogram package because StatsBase seems reluctant to support more advanced usage such as:

hist1 / hist2

# thread-safe
push!(hist, val, weight)

Who said this? Actually at https://github.com/JuliaStats/StatsBase.jl/issues/650 I said almost the contrary. Improvements to histograms in StatsBase would be welcome (though I admit finding reviewers isn't always easy).

nalimilan avatar Aug 31 '21 08:08 nalimilan

Oh wow @oschulz have those methods not been moved yet? Now is clearly the time :-)

mkborregaard avatar Aug 31 '21 08:08 mkborregaard

You're so right @mkborregaard . We should do it dispatched-based, with binning algorithms structs (so they can, in principle, take parameters), not with keyword-symbols like in Plots. And it could tie in with #514 (still on my to-do list).

oschulz avatar Aug 31 '21 14:08 oschulz

If the number of bins is approximated it would be good to document how it is approximated. Even better would be the possibility to select the approximation algorithm. This could then also have the option to get the exact number of bins.

A few options I found are:

  • exact
  • sturges
  • rice
  • scott

feanor12 avatar Oct 13 '21 21:10 feanor12

Yes, that's the kind of methods we have in Plot now and that we should move over to StatsBase (with algorithm structs, not keywords like in Plots).

oschulz avatar Oct 13 '21 21:10 oschulz

Gentle reminder if schedules permit:

@oschulz ...

I think the cleanest solution would be to have binning algorithms, represented by types/structs that can then have parameters like exact. @mkborregaard and me did add a several binning algorithms to Plots a long time ago (https://github.com/JuliaPlots/Plots.jl/blob/fd46fd4addd040c4e5b73b00e71109a46eea620e/src/recipes.jl#L789), but specified as keywords. This had always been intended to be an interim solution until we added them to StatsBase.Histogram in a proper way (with types), we just never got around to it. Maybe it's really time now.

@mkborregaard ...

Oh wow @oschulz have those methods not been moved yet? Now is clearly the time :-)

@oschulz ...

You're so right @mkborregaard . We should do it dispatched-based, with binning algorithms structs (so they can, in principle, take parameters), not with keyword-symbols like in Plots. And it could tie in with #514 (still on my to-do list).

thompsonmj avatar Jan 17 '22 15:01 thompsonmj

Thanks for the reminder! It's been on the back of my mind, I'll try to get on it soonish.

oschulz avatar Jan 17 '22 15:01 oschulz

This code works now: h = fit(Histogram, x, range(minimum(x), stop=maximum(x), length=nbins)).

huixinzhang avatar May 09 '22 20:05 huixinzhang

Hi @huixinzhang, I think the core of the discussion centers on the default response to using the nbins keyword argument. The workaround using range() is useful, but I believe nbins should use the exact number of bins asked for if the code is to speak for itself!

thompsonmj avatar Jun 01 '22 20:06 thompsonmj