StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

Preserve eltype where possible for moments

Open palday opened this issue 4 years ago • 6 comments

Fixes #680.

palday avatar Apr 23 '21 17:04 palday

The failure on nightly is an upstream bug (https://github.com/JuliaLang/julia/issues/40609).

palday avatar Apr 26 '21 11:04 palday

This looks good but the prior uses of Float64 make me worried that the code may have issues around under/overflow when using other types.

ararslan avatar Apr 26 '21 19:04 ararslan

Ugh, I'm conflicted on that front. Because that would be breaking, but technically a user hitting overflow on their custom narrow types is expected behavior. @nalimilan do we have enough for another breaking pre-1.0 release?

palday avatar Apr 26 '21 19:04 palday

We can break things if we want to tag 1.0. Though ideally I'd rather move these to Statistics, but that work has stalled (https://github.com/JuliaLang/julia/pull/31395).

nalimilan avatar Apr 26 '21 20:04 nalimilan

Okay, so on the breaking / non breaking front:

  1. Did the API ever specify the return type for mean and the various moments? If not, this is not breaking in terms of type behavior.
  2. In terms of numeric behavior, Julia generally keeps things at narrow types even at the risk of over-/underflow. As such, keeping the narrowest type that would be possible for the arithmetic operations in these computations seems fine. In other words, the new return type and potential numeric implications are expected in the Julia ecosystem.
  3. The potential numeric issues with all of the moments here are possible regardless of type width -- we do accumulate then divide without sorting first, so
    • we have issues with computations where things differ in scale and the smaller values accumulate enough to be relevant on the larger scale, but consistently round down when the initial values in the array are large (sorting solves this but is computationally expensive)
    • we can have overflow in the accumulation step (batching the accumulate+divide on subsets solves this but potentially causes performance and numeric issues from the extra divisions)
  4. The way for best numerical accuracy would be sort, do the math in a higher precision and cast down. But this also seems unacceptable -- if a user choose to keep everything in single or half precision, then we should assume there is a reason for that. If they really need that level of numerical stability, then they should probably just implement their own moments.

In other words: given that we're not going to change the algorithm used (which I think is reasonable), the potential numeric issues from the eltype preservation should not be considered breaking. Perhaps the best solution is to add to each of the docstrings:

!!! note
    The computation is done in the narrowest type for which the underlying 
    arithmetic operations are defined based on the element types of the 
    vectors and the mean. For higher numerical accuracy and to reduce the risk
    of overflow when dealing with values near the limits of the type (or many 
    values whose sum is near the limits of the type), we recommend casting to 
    a broader type / higher precision first.

palday avatar Apr 26 '21 21:04 palday

Yeah it's not super likely to break user code. Though we could collect the few changes which are breaking (some of them are marked with the breaking label) and take that occasion to tag 1.0.

nalimilan avatar Apr 27 '21 07:04 nalimilan