StatsBase.jl
StatsBase.jl copied to clipboard
Improve weights docstrings
AnalyticWeights have a precise definition, on which we rely in several functions.
Also make docstrings more consistent across types.
I think we should make an important note for both AnalyticWeights and FrequencyWeights that the scale of the weights matters, i.e. f(x, weights) will usually not equal f(x, 2 .* weights).
It would also help to be very specific about the following, to avoid people making the mistake of trying to use normalized weights:
For AnalyticWeights, w[i] must be equal to 1/var(x[i]), and var(x[i]) must be known ahead of time.
For FrequencyWeights, w[i] must be equal to the number of observations for x[i].
I believe the docstring for ProbabilityWeights is already clear enough, because probability weights are scale-invariant. (For now; in theory we might want to make a distinction between self-normalized and unnormalized weights in the future, because unnormalized weights can give unbiased estimators, but ATM we only use self-normalized weights).
I think we should make an important note for both
AnalyticWeightsandFrequencyWeightsthat the scale of the weights matters, i.e.f(x, weights)will usually not equalf(x, 2 .* weights).It would also help to be very specific about the following, to avoid people making the mistake of trying to use normalized weights: For
AnalyticWeights,w[i]must be equal to1/var(x[i]), andvar(x[i])must be known ahead of time. ForFrequencyWeights,w[i]must be equal to the number of observations forx[i].
I've added mentions regarding scale-invariance of frequency and probability weights. For analytic weights I'd rather wait until https://github.com/JuliaStats/StatsBase.jl/issues/758 is settled as we may want to adjust the definition a bit (maybe in the next breaking release). AFAICT currently they are scale-invariant, right?
I've updated the description of analytic weights in the light of https://github.com/JuliaStats/StatsBase.jl/issues/758. Does that sound correct?
I've updated the description of analytic weights in the light of #758. Does that sound correct?
I think we should hold off until #758 is resolved.
Oh, brief note that I think could be useful for users -- currently, all our methods for ProbabilityWeights normalize the weights before calculating an estimator. This is probably the best default, but sometimes it's useful to use the Hansen-Hurwitz estimators for means, variances, etc.; these estimators use the unnormalized weights, which makes them unbiased (but usually results in higher variance). The behavior of these estimators can be replicated by setting sum=1, in which case the weights won't be normalized.
@ParadaCarleton Don't you think that this PR is a strict improvement over the current situation, even if we decide to split AnalyticWeights into several types?
I think it's an improvement, yeah, but I'd clarify that the weights:
- Refer specifically to sample sizes for each observation.
- I'd add a warning about
stddoing something different from what you'd expect.