StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

Throw an error for non-integer `FrequencyWeights`

Open ParadaCarleton opened this issue 3 years ago • 2 comments

Currently, it is possible to construct a set of FrequencyWeights with non-integer weights:

julia> FrequencyWeights([.5, .1])
2-element FrequencyWeights{Float64, Float64, Vector{Float64}}:
 0.5
 0.1

However, the definition of frequency weights only allows for integer-valued weights, and currently-implemented methods either assume that frequency weights are integer-valued and sum up to the sample size, or else check themselves that the weights are integer-valued. This creates a lot of duplicate checks and also increases the risk of an error if someone writes a method forgetting to check whether or not frequency weights are all integer-valued. I propose we move the error upstream so that attempting to construct FrequencyWeights with non-integer values throws an error, rather than throwing an error whenever a function is called on invalid weights.

ParadaCarleton avatar Jan 18 '22 17:01 ParadaCarleton

AFAIK all definitions for frequency weights generalize perfectly to non-integer weights. The only place where we check that weights are integers is in quantile, right? See https://github.com/JuliaStats/StatsBase.jl/pull/436.

I guess we could add a check, but throwing an error would be breaking so it has to wait the next breaking release. We could add a check argument though, keeping it to false for now.

nalimilan avatar Jan 23 '22 17:01 nalimilan

AFAIK all definitions for frequency weights generalize perfectly to non-integer weights. The only place where we check that weights are integers is in quantile, right? See #436.

I think that's the only place where we check that, but we implicitly assume the frequency weights are counts (rather than proportions): all code for FrequencyWeights assumes that sum(weights) is equal to the sample size at the moment.

(To be clear, my proposal is to check that all the weights are integers, not that all of them are Int. This allows for things like representing large integers as floats.)

ParadaCarleton avatar Jan 28 '22 01:01 ParadaCarleton