Hmisc icon indicating copy to clipboard operation
Hmisc copied to clipboard

wtd.var: normwt and documentation

Open tyner opened this issue 9 years ago • 25 comments

I noticed one of the changes in version 3.14-6 (2014-11-16):

wtd.var: corrected denominator. Thanks: Shan Huang

While I agree that the previous denominator was incorrect, I would also add that new denominator causes the return value to no longer depend on the value of the 'normwt' argument (which can be proved with a little algebra).

On the other hand, the documentation for this function suggests that 'normwt' is supposed to have an effect:

When normwt=FALSE the weighted variance will not equal the unweighted variance even if the weights are identical. That is because of the subtraction of 1 from the sum of the weights in the denominator of the variance formula. If you want the weighted variance to equal the unweighted variance when weights do not vary, use normwt=TRUE.

So, does the documentation need to be updated? Or perhaps the change to the denominator was too heavy-handed?

tyner avatar Mar 20 '15 14:03 tyner

I've lost track of this. Would you mind delving into it and offering a patch? Thanks.

harrelfe avatar Apr 16 '15 12:04 harrelfe

In my opinion, normalization of weights is irrelevant here, and so the documentation needs updating.

However, we might want to offer another argument to control whether the MLE is computed versus the unbiased estimator instead. The current functionality corresponds to the latter. If you are interested in the MLE, I'm happy to send code for that.

tyner avatar Apr 16 '15 22:04 tyner

I would like to post a new version of Hmisc to CRAN by about 2015-04-22 so if you want to take on the job of enhancing the code and correcting the documentation that would be fantastic. And please add yourself to the authorship in the help file.

harrelfe avatar Apr 18 '15 12:04 harrelfe

After thinking about this some more, I think the documentation is largely correct, but I would add a clarifying statement along the lines of:

If 'weights' are frequency weights, then 'normwt' should be FALSE, and if 'weights' are normalization (aka reliability) weights, then 'normwt' should be TRUE. In the case of the former, no check is made that 'weights' are valid frequencies.

I have attached a proposal for the new code (apologies for not supplying an actual patch; this svn user would someday like to get educated on git). Code follows:

wtd.var <- function(x, weights = NULL, normwt = FALSE, na.rm = TRUE,
                    method = c("unbiased", "ML"))
{
  if (!length(weights)) {
    if (na.rm) x <- x[!is.na(x)]
    return(var(x))
  }

  if (na.rm) {
    s <- !is.na(x + weights)
    x <- x[s]
    weights <- weights[s]
  }

  method <- match.arg(method)
  if (method == "ML") {
     return(as.numeric(stats::cov.wt(cbind(x), w, method = "ML")$cov))
  } else {
     sw <- sum(weights)
     xbar <- sum(weights * x) / sw
     num <- sum(weights * ((x - xbar)^2))

     den <- sw - if (normwt) sum(weights^2) / sw else 1

     return(num / den)
  }
}

As you can see, it adds a new argument 'method' corresponding to that of stats::cov.wt, for which the documentation could be:

The 'method' argument determines the estimator type; if 'unbiased' (the default) then the usual unbiased estimate (using Bessel's correction) is returned, if 'ML' then it is the maximum likelihood estimate for a Gaussian distribution. In the case of the latter, the 'normwt' argument has no effect.

tyner avatar Apr 19 '15 19:04 tyner

Excellent. Thanks very much. I incorporated all that and added you as an author in the help file. If you want me to include an affiliation and/or and e-mail address please let me know what to put (by private email if you want: [email protected]). This will be in the upcoming release to CRAN in several days. I made a couple of small changes to your code. Please check the function if you have time. I ran one test:

x=c(1,3,5,6)
wtd.var(x)
[1] 4.916667
w=c(1,1,1,3)
wtd.var(x, w, method='un')
4.3
wtd.var(x, w, method='ML')
[1] 3.583333
var(c(1,3,5,6,6,6))
[1] 4.3

harrelfe avatar Apr 20 '15 02:04 harrelfe

I took a look ... don't you need to add a formal 'method' argument to the function alist?

Personally I don't see a need to actually normalize the weights when normwt=TRUE.

It's fine to include my email ([email protected]) in there.

By the way, the wikipedia article cited has an error...I'll try to remember to fix that later.

tyner avatar Apr 20 '15 02:04 tyner

Sorry the first version I posted forgot to add the method argument. It's corrected now. If you want to make any other changes to code or documentation, let me know. In the future feel free to check out the package and make changes that you post for me to merge with the main trunk.

harrelfe avatar Apr 20 '15 12:04 harrelfe

Thanks, it's working now. I also took a stab at correcting the wikipedia article.

tyner avatar Apr 20 '15 13:04 tyner

I wonder whether there is still an inconsistency in the documentation. The description says:

weights can also be sampling weights, in which setting normwt to TRUE will often be appropriate. This results in making weights sum to the length of the non-missing elements in x.

Yet the description of the normwt argument now says:

if weights are normalization (aka reliability) weights, then normwt should be TRUE

So when normwt=TRUE, weights can be either sampling or reliability weights? It doesn't seem to me that these can be considered as equivalent. And indeed svyvar from the survey package (which uses sampling weights) gives different results. Should the documentation be changed?

nalimilan avatar Apr 26 '17 09:04 nalimilan

I agree that sampling and reliability weights are not equivalent, and I am dubious that normwt=TRUE would ever be appropriate for sampling weights.

tyner avatar Apr 27 '17 11:04 tyner

I think normwt=TRUE would only be appropriate only if someone forgot to normalize the sampling weights to add to 1.

harrelfe avatar Apr 27 '17 16:04 harrelfe

To me, the normwt argument is only for specifying the kind of weights. In many cases, sampling weights are realizations of random variables, in which case I don't think they should be expected to sum to a predetermined value.

tyner avatar Apr 27 '17 19:04 tyner

Sampling weights do not in general sum to 1. Actually, one of the possible normalizations is to have them sum to the number of observations, which gives correct inference for some applications (like regression).

But really the issue isn't what the sum is equal to, it's that the formula is quite different for sampling weights. See my summary of wtd.var vs. svyvar. It seems to me that wtd.var shouldn't mention sampling weights at all, or maybe better mention explicitly that the correction is not valid for them.

nalimilan avatar Apr 28 '17 07:04 nalimilan

That link doesn't work for me, but I see nothing wrong with the existing treatment of sampling weights:

set.seed(6860)
x <- runif(100)
w <- sample(1:10, size=100, replace=TRUE)
stats::var(rep(x, w))
[1] 0.08376841
wtd.var(x, w, normwt = FALSE)
[1] 0.08376841
wtd.var(rep(x, w), normwt = FALSE)
[1] 0.08376841

tyner avatar Apr 28 '17 10:04 tyner

Sorry, I've just fixed the link.

The example in your last comment illustrates frequency/case/repeat weights, not sampling weights. For a presentation of the difference, see e.g. this link. The variance using sampling weights is:

set.seed(6860)
x <- runif(100)
w <- sample(1:10, size=100, replace=TRUE)
library(survey)
d <- svydesign(ids=~1, weights=~w, data=data.frame(x, w))
svyvar(~x, d)
  variance     SE
x 0.084468 0.0077

nalimilan avatar Apr 28 '17 14:04 nalimilan

My mistake; was not aware of the distinction between frequency and sampling weights. I must retract my statement about sampling weights being realizations of a random variable; what I had in mind there was in fact frequency weights.

tyner avatar Apr 28 '17 16:04 tyner

The help page says "These functions compute various weighted versions of standard estimators." therefore for wtd.var, the goal is to estimate a population variance. However, if the sampling weights (probabilities) are known exactly, then isn't the variance also known exactly, and thus estimation is moot?

tyner avatar Apr 30 '17 12:04 tyner

No, you can know the sampling probabilities without knowing anything about the variable you are analyzing.

harrelfe avatar Apr 30 '17 15:04 harrelfe

Good point Frank. Is sampling probability also known as inclusion probability?

https://en.wikipedia.org/wiki/Inclusion_probability

tyner avatar May 01 '17 23:05 tyner

I guess so. That's not as standard a term for it.

Frank Harrell Chairman, Department of Biostatistics

On May 1, 2017 6:02 PM, Benjamin Tyner [email protected] wrote:

Good point Frank. Is sampling probability also known as inclusion probability?

https://en.wikipedia.org/wiki/Inclusion_probability

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/harrelfe/Hmisc/issues/22#issuecomment-298455466, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABGO2jMci9ioRSVUlroUOrL9sqRqGhrCks5r1mSDgaJpZM4DyBDi.

harrelfe avatar May 01 '17 23:05 harrelfe

So that takes us into the realm of stratified sampling. I suppose we could refine wtd.var to accept inverse probability weights...would it also be desirable to also handle second-order inclusion probabilities?

tyner avatar May 03 '17 01:05 tyner

I don't think wtd.var should take care of multi-stage sampling, for which svyvar is very good. It could make sense to support inverse probability weights (a.k.a. sampling weights), but that's not a requirement either: what matters most is that the documentation is very clear about it.

nalimilan avatar May 03 '17 09:05 nalimilan

By the way, do you know about other software that implements the same correction as wtd.var for analytic/reliability weights (normwt=TRUE)? I looked at SAS and Stata and I couldn't find a way to compute it.

nalimilan avatar May 03 '17 10:05 nalimilan

Agreed that multi-stage is beyond the scope here. By second-order inclusion probabilities, I mean the pairwise probabilities of inclusion for a single-stage design.

For other software: did you already try the 'aweight' functionality in Stata?

tyner avatar May 04 '17 02:05 tyner

Agreed that multi-stage is beyond the scope here. By second-order inclusion probabilities, I mean the pairwise probabilities of inclusion for a single-stage design.

Ah, sorry. I would say that's also out of scope, as that's a much more complex scenario than what wtd.var handles currently.

For other software: did you already try the 'aweight' functionality in Stata?

Yes, but for obscure reasons it doesn't use the right correction.

nalimilan avatar May 04 '17 08:05 nalimilan