StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

Improve weights docstrings

Open nalimilan opened this issue 3 years ago • 7 comments

AnalyticWeights have a precise definition, on which we rely in several functions. Also make docstrings more consistent across types.

nalimilan avatar Jan 18 '22 21:01 nalimilan

I think we should make an important note for both AnalyticWeights and FrequencyWeights that the scale of the weights matters, i.e. f(x, weights) will usually not equal f(x, 2 .* weights).

It would also help to be very specific about the following, to avoid people making the mistake of trying to use normalized weights: For AnalyticWeights, w[i] must be equal to 1/var(x[i]), and var(x[i]) must be known ahead of time. For FrequencyWeights, w[i] must be equal to the number of observations for x[i].

I believe the docstring for ProbabilityWeights is already clear enough, because probability weights are scale-invariant. (For now; in theory we might want to make a distinction between self-normalized and unnormalized weights in the future, because unnormalized weights can give unbiased estimators, but ATM we only use self-normalized weights).

ParadaCarleton avatar Jan 18 '22 22:01 ParadaCarleton

I think we should make an important note for both AnalyticWeights and FrequencyWeights that the scale of the weights matters, i.e. f(x, weights) will usually not equal f(x, 2 .* weights).

It would also help to be very specific about the following, to avoid people making the mistake of trying to use normalized weights: For AnalyticWeights, w[i] must be equal to 1/var(x[i]), and var(x[i]) must be known ahead of time. For FrequencyWeights, w[i] must be equal to the number of observations for x[i].

I've added mentions regarding scale-invariance of frequency and probability weights. For analytic weights I'd rather wait until https://github.com/JuliaStats/StatsBase.jl/issues/758 is settled as we may want to adjust the definition a bit (maybe in the next breaking release). AFAICT currently they are scale-invariant, right?

nalimilan avatar Jan 20 '22 08:01 nalimilan

I've updated the description of analytic weights in the light of https://github.com/JuliaStats/StatsBase.jl/issues/758. Does that sound correct?

nalimilan avatar Jan 23 '22 21:01 nalimilan

I've updated the description of analytic weights in the light of #758. Does that sound correct?

I think we should hold off until #758 is resolved.

ParadaCarleton avatar Jan 24 '22 01:01 ParadaCarleton

Oh, brief note that I think could be useful for users -- currently, all our methods for ProbabilityWeights normalize the weights before calculating an estimator. This is probably the best default, but sometimes it's useful to use the Hansen-Hurwitz estimators for means, variances, etc.; these estimators use the unnormalized weights, which makes them unbiased (but usually results in higher variance). The behavior of these estimators can be replicated by setting sum=1, in which case the weights won't be normalized.

ParadaCarleton avatar Jan 27 '22 02:01 ParadaCarleton

@ParadaCarleton Don't you think that this PR is a strict improvement over the current situation, even if we decide to split AnalyticWeights into several types?

nalimilan avatar Mar 20 '22 16:03 nalimilan

I think it's an improvement, yeah, but I'd clarify that the weights:

  1. Refer specifically to sample sizes for each observation.
  2. I'd add a warning about std doing something different from what you'd expect.

ParadaCarleton avatar Mar 21 '22 02:03 ParadaCarleton