StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

Inf and NaN in weights

Open aplavin opened this issue 2 years ago • 7 comments

Currently, weights throws an error whenever it encounters non-finite values: ArgumentError: weights cannot contain Inf or NaN values. That's highly surprising IMO, basically all other operations propagate NaNs and allow Infs – both basic arithmetics and aggregations.

# totally sensible and unambiguous results:
julia> median([1,2,3,Inf])
2.5
julia> mean([1,2,3,Inf])
Inf
julia> mean([1,2,3,Inf], weights([1,1,1,1]))
Inf

julia> mean([1,2,3,NaN])
NaN
julia> median([1,2,3,NaN])
NaN

# and suddenly:
julia> mean([1,2,3,4], weights([1,1,Inf,1]))
ERROR: ArgumentError: weights cannot contain Inf or NaN values
julia> mean([1,2,3,4], weights([1,1,NaN,1]))
ERROR: ArgumentError: weights cannot contain Inf or NaN values

I looked up as found that this error was introduced after the issue https://github.com/JuliaStats/StatsBase.jl/issues/671. I think throwing this exception denies perfectly sensible behavior, so that weighted aggregations worked exactly as their unweighted counterparts – by propagating NaNs. And the throw in weights() should be removed, maybe adding it in specific aggregations where Infs/NaNs really do not make any sense (eg, mean and median should work).

Numpy does propagate, of course:

In [2]: np.average([1,2,3], weights=[1,np.nan,3])
Out[2]: nan

aplavin avatar Nov 29 '23 21:11 aplavin

Bump...

aplavin avatar Dec 27 '23 20:12 aplavin

Gentle bump... I understand that developer time is always scarce, so: would a PR fixing this be accepted?

aplavin avatar Jan 16 '24 01:01 aplavin

I wasn't involved in the previous discussions in #671 and https://github.com/JuliaStats/StatsBase.jl/pull/814, but my impression was that the errors are intentional and the previous behaviour (no errors) was considered a bug. I think I tend to agree with this view since weights are supposed to sum to 1 (or at least it should be able to normalize them to such weights), which is not possible if a weight is NaN or Inf.

devmotion avatar Jan 16 '24 01:01 devmotion

The whole point of NaN is that it propagates, and plain aggregations play nicely with that – mean of an array with NaN is NaN. This is natural and convenient when doing many aggregations, with a few of them having NaNs: they just appear in the correct locations in the results, instead of requiring try/catch or manual isnan checks everywhere. It's surprising and inconvenient that as soon as one needs weighted aggregations, suddenly there's no way around – manual trycatch/isnan checks are required.

aplavin avatar Jan 16 '24 01:01 aplavin

I don't see a clear inconsistency, there's no restriction on the values and Inf/NaN propagate in the same way for weighted operations:

julia> using StatsBase

julia> mean([1,2,3,Inf])
Inf

julia> mean([1,2,3,Inf], weights([1,1,1,1]))
Inf

julia> mean([1,2,3,NaN])
NaN

julia> mean([1,2,3,NaN], weights([1,1,1,1]))
NaN

The restriction only applies to the weights themselves because they have to sum up to 1.

devmotion avatar Jan 16 '24 01:01 devmotion

The restriction only applies to the weights themselves because they have to sum up to 1.

And if they don't because there's a NaN – then the result is... Not a Number :) Exactly the same as with mean([1, NaN]) that already returns NaN because there's no actual number that represents the result. I don't really see any difference in these situations, and the same motivation works in both cases.

Consider two scenarios:

Have an array of arrays, and want to compute mean of each inner array:

mean.(A)

and NaN are correctly propagated.

Have an array of arrays of values, and array of arrays of weights, and want to compute weighted means?

mean.(A, weight.(W))

doesn't work as soon as there are any NaNs inside W, and requires manual handling.

aplavin avatar Jan 16 '24 01:01 aplavin

For those stumbling across the same issue: AbstractWeights and many functions that take them actually support NaNs. Defining another type MyWeights <: AbstractWeights without NaN checks in the constructor will work just fine!

Or, even easier – just overload the Weights type constructor by putting this code anywhere you like:

using StatsBase
@generated Weights{S,T,V}(values, sum) where {S<:Real, T<:Real, V<:AbstractVector{T}} = Expr(:new, Weights{S,T,V}, :values, :sum)

That's what I personally do.

It doesn't change any existing behavior, just adds support for non-finite values in weight vectors. Then:

julia> mean([1,2], weights([1,NaN]))
NaN

aplavin avatar Jan 21 '24 21:01 aplavin