Distributions.jl icon indicating copy to clipboard operation
Distributions.jl copied to clipboard

`suffstats` integer overflow

Open scoopxyz opened this issue 1 year ago • 1 comments

Hello,

When attempting to do some distribution fitting on integer data, I ran into overflow when using suffstats and fit, my use case made me notice it for a Normal distribution but its likely to show up elsewhere.

This can be easily reproduced via:

using Distributions

samples = UInt16.(round.(randn(1_000_000) .* 100 .+ 1000));
ss = suffstats(Normal, samples)
fit_mle(Normal, ss)

yielding...

Normal{Float64}(μ=0.031291, σ=1004.9262791...)

This is due to the type stability/enforcement used in the summation here: https://github.com/JuliaStats/Distributions.jl/blob/c1705a3015d438f7e841e82ef5148224813831e8/src/univariate/continuous/normal.jl#L134-L137

The method accepts all subtypes of Real so Integers fall into the "expected" use-case, but it would clearly be fragile with smaller integer types. I can imagine a few scenarios to deal with this:

  1. Leave as-is and document the potential issue, maybe suggest against Integer types
  2. Specialize on Integer eltype and accumlate into an Int64 variable, possibly error on overflow
  3. utilize sum from Base rather than a manual loop

All values are cast to Float64 in the returned struct, so I'm not seeing the necessity for enforcing type stability in this case.

scoopxyz avatar Feb 22 '24 23:02 scoopxyz

It's very old code. I think we should just use sum here and when compute the second central moment below. It would be great if you could prepare a PR with your example as a test.

andreasnoack avatar Feb 23 '24 07:02 andreasnoack