StatsBase.jl
StatsBase.jl copied to clipboard
Preserve eltype where possible for moments
Fixes #680.
The failure on nightly is an upstream bug (https://github.com/JuliaLang/julia/issues/40609).
This looks good but the prior uses of Float64 make me worried that the code may have issues around under/overflow when using other types.
Ugh, I'm conflicted on that front. Because that would be breaking, but technically a user hitting overflow on their custom narrow types is expected behavior. @nalimilan do we have enough for another breaking pre-1.0 release?
We can break things if we want to tag 1.0. Though ideally I'd rather move these to Statistics, but that work has stalled (https://github.com/JuliaLang/julia/pull/31395).
Okay, so on the breaking / non breaking front:
- Did the API ever specify the return type for
meanand the various moments? If not, this is not breaking in terms of type behavior. - In terms of numeric behavior, Julia generally keeps things at narrow types even at the risk of over-/underflow. As such, keeping the narrowest type that would be possible for the arithmetic operations in these computations seems fine. In other words, the new return type and potential numeric implications are expected in the Julia ecosystem.
- The potential numeric issues with all of the moments here are possible regardless of type width -- we do accumulate then divide without sorting first, so
- we have issues with computations where things differ in scale and the smaller values accumulate enough to be relevant on the larger scale, but consistently round down when the initial values in the array are large (sorting solves this but is computationally expensive)
- we can have overflow in the accumulation step (batching the accumulate+divide on subsets solves this but potentially causes performance and numeric issues from the extra divisions)
- The way for best numerical accuracy would be sort, do the math in a higher precision and cast down. But this also seems unacceptable -- if a user choose to keep everything in single or half precision, then we should assume there is a reason for that. If they really need that level of numerical stability, then they should probably just implement their own moments.
In other words: given that we're not going to change the algorithm used (which I think is reasonable), the potential numeric issues from the eltype preservation should not be considered breaking. Perhaps the best solution is to add to each of the docstrings:
!!! note
The computation is done in the narrowest type for which the underlying
arithmetic operations are defined based on the element types of the
vectors and the mean. For higher numerical accuracy and to reduce the risk
of overflow when dealing with values near the limits of the type (or many
values whose sum is near the limits of the type), we recommend casting to
a broader type / higher precision first.
Yeah it's not super likely to break user code. Though we could collect the few changes which are breaking (some of them are marked with the breaking label) and take that occasion to tag 1.0.