Hmisc icon indicating copy to clipboard operation
Hmisc copied to clipboard

wtd.quantile

Open Stefan2015-5 opened this issue 6 years ago • 13 comments

I noticed a very surprising result of the wtd.quantile function. If the weights are not integers, there seems to be an issue:

> require(Hmisc) > data <- c(1,2,3) > weights<- c(0.1, 0.1, 0.1) > wtd.quantile(data, weights=weights, probs=0.5) 50% 3 > weights<- c(0.1, 0.1, 0.1)* 10 > wtd.quantile(data, weights=weights, probs=0.5) 50% 2

Stefan2015-5 avatar Oct 11 '18 11:10 Stefan2015-5

This is definitely an error. Code fix welcomed.

harrelfe avatar Oct 14 '18 15:10 harrelfe

Checking the function, I see it is assuming weights are integers, like nr of observations. I saw the function that is currently implemented can not handle (i) any non-integer numbers in weights correctly and (ii) not remove NAs despite the na.rm parameter. I think the easiest solution is to scale the weights up, a quick fix is here. I marked new lines with ##### new


 wtd.quantile<- function (x, weights = NULL, probs = c(0, 0.25, 0.5, 0.75, 1), 
           type = c("quantile", "(i-1)/(n-1)", "i/(n+1)", "i/n"), normwt = FALSE, 
           na.rm = TRUE)  {
   if (!length(weights))      return(quantile(x, probs = probs, na.rm = na.rm))
   type <- match.arg(type)
   if (any(probs < 0 | probs > 1))      stop("Probabilities must be between 0 and 1 inclusive")
   nams <- paste(format(round(probs * 100, if (length(probs) > 
                                               1) 2 - log10(diff(range(probs))) else 2)), "%", sep = "")
   
   if(na.rm & any(is.na(weights))){   ###### new
     i<- is.na(weights)
     x <- x[!i]
     weights <- weights[!i]
   }
   i <- weights == 0
   if (any(i)) {
     x <- x[!i]
     weights <- weights[!i]
   }
   if (type == "quantile") {
     if(sum(weights) < 1000000 ) {weights<- weights*1000000/sum(weights)}  ##### new 
     w <- wtd.table(x, weights, na.rm = na.rm, normwt = normwt, 
                    type = "list")
     x <- w$x
     wts <- w$sum.of.weights
     n <- sum(wts)
     order <- 1 + (n - 1) * probs
     low <- pmax(floor(order), 1)
     high <- pmin(low + 1, n)
     order <- order%%1
     allq <- approx(cumsum(wts), x, xout = c(low, high), method = "constant", 
                    f = 1, rule = 2)$y
     k <- length(probs)
     quantiles <- (1 - order) * allq[1:k] + order * allq[-(1:k)]
     names(quantiles) <- nams
     return(quantiles)
   }
   w <- wtd.Ecdf(x, weights, na.rm = na.rm, type = type, normwt = normwt)
   structure(approx(w$ecdf, w$x, xout = probs, rule = 2)$y, 
             names = nams)
 }
 

Stefan2015-5 avatar Oct 18 '18 17:10 Stefan2015-5

I apologize for not studying those more before my first reply. I think that in your original example you need to specify normwt=TRUE when weights do not add up to n. When you do that you get the proper median of 2.0. wtd.table handles fractional weights properly, so I don't think your changes are needed other than the more accurate way of dealing with na.rm.

harrelfe avatar Oct 18 '18 18:10 harrelfe

No, even with normwt=TRUE the function produces incorrect results. (On top of that I can't imagine a useful interpretation of the result when normwt=FALSE)

 require(Hmisc)
 data <- 1:3
 weights<- c(0.1, 0.1, 0.8)
 probs=0.09
 
 wtd.quantile(x=data, weights=weights, probs=probs, normwt=TRUE) #=3 ; incorrect. 
 wtd.quantile2(x=data, weights=weights, probs=probs)  #=1; correct. this uses the modified function I posted above. 
 

The issue is the one I outlined above. I suggest to drop the parameter normwt, or keep it just for backward compatability without use.

Stefan2015-5 avatar Oct 19 '18 11:10 Stefan2015-5

normwt=FALSE is very seldom used but there is a legitimate use when weights represent actual case weights, i.e., an observation with a weight of 3 actually was sampled 3 times.

If wtd.quantile fails, then either wtd.table or wtd.ecdf must be the culprit. Scaling weights up is not the ultimate solution and is not scalable to massive problems.

harrelfe avatar Oct 20 '18 13:10 harrelfe

If normwt=FALSE the code produces the very same results as normwt=TRUE except in some cases where either normwt=FALSE fails or both fail. To convince me otherwise please provide a counter example. But even if: "very seldom" is already making it clear: it should certainly not be the standard value.

"If wtd.quantile fails, then either wtd.table or wtd.ecdf must be the culprit" What makes you think that? From what I can see wtd.table behaves just fine and wtd.ecdf is not even called in the incorrect examples. Again: the calcs of the inside function objects order, low and high are assuming weights to be positive integers (under normwt=FALSE and normwt= TRUE). I agree that "scaling up is not the ultimate solution" but if you do not want the whole function changed, it is the only I am seeing.

BTW: the function also does not handle negative weights (neither gives a warning, error or provides a meaningful result). I think the best would be to drop these together with zero weights.

Stefan2015-5 avatar Oct 22 '18 07:10 Stefan2015-5

I agree re: negative weights. And I agree in principle about changing the default for normwt, but it's too late to do that now.

wtd.table nowhere assumes integer weights, as 'frequencies' are sums of weights. I suspect the problem is more related to the rules that are used for computing quantiles when the quantile being requested is a a value where there are ties in the data, or the wtd.table method is not interpolated correctly to give the requested quantile.

harrelfe avatar Oct 22 '18 18:10 harrelfe

Any update on this?

I find the following case to have unexpected result.

>>> x <- c(1,2)
>>> weights <- c(1,9)
>>> wtd.quantile(x, weights=weights, normwt=TRUE, prob=0)
0%
2

I believe I can help fixing.

iii-org-tw avatar Jul 21 '20 10:07 iii-org-tw

The functions are broken and the maintainer undiscerning (see thread). To me the package is dead.  Use my functions instead, where I fixed the issues. 

wtd.quantile<- function (x, weights = NULL, probs = c(0, 0.25, 0.5, 0.75, 1),
                         type = c("quantile", "(i-1)/(n-1)", "i/(n+1)", "i/n"),
                         na.rm = TRUE)  {
  # Function taken from HMISC, but issue solved which is documented here: https://github.com/harrelfe/Hmisc/issues/97#issuecomment-429634634
  normwt = FALSE
  if (!length(weights))      return(quantile(x, probs = probs, na.rm = na.rm))
  type <- match.arg(type)
  if (any(probs < 0 | probs > 1))      stop("Probabilities must be between 0 and 1 inclusive")
  nams <- paste(format(round(probs * 100, if (length(probs) >
                                              1) 2 - log10(diff(range(probs))) else 2)), "%", sep = "")

  if(na.rm & any(is.na(weights))){   ###### new
    i<- is.na(weights)
    x <- x[!i]
    weights <- weights[!i]
  }
  i <- weights <= 0         # nwe: kill negative and zero weights and associated data
  if (any(i)) {
    x <- x[!i]
    weights <- weights[!i]
  }
  if (type == "quantile") {
    if(sum(weights) < 1000000 ) {weights<- weights*1000000/sum(weights)}  ##### new
    w <- wtd.table(x, weights, na.rm = na.rm, normwt = normwt,
                   type = "list")
    x <- w$x
    wts <- w$sum.of.weights
    n <- sum(wts)
    order <- 1 + (n - 1) * probs
    low <- pmax(floor(order), 1)
    high <- pmin(low + 1, n)
    order <- order%%1
    allq <- approx(cumsum(wts), x, xout = c(low, high), method = "constant",
                   f = 1, rule = 2)$y
    k <- length(probs)
    quantiles <- (1 - order) * allq[1:k] + order * allq[-(1:k)]
    names(quantiles) <- nams
    return(quantiles)
  }
  w <- wtd.Ecdf(x, weights, na.rm = na.rm, type = type, normwt = normwt)
  structure(approx(w$ecdf, w$x, xout = probs, rule = 2)$y,
            names = nams)
}




wtd.table<- function (x, weights = NULL, type = c("list", "table"), normwt = FALSE,
                      na.rm = TRUE) {
  # Function taken from HMISC, but issue solved which is documented here: https://github.com/harrelfe/Hmisc/issues/97#issuecomment-429634634
  p_load(timeDate)
  type <- match.arg(type)
  if (!length(weights))
    weights <- rep(1, length(x))
  isdate <- ( class(x)[1]=="Date"  | class(x)[1]=="POSIXct") ### MODIFIED
  ax <- attributes(x)
  ax$names <- NULL
  if (is.character(x))
    x <- as.factor(x)
  lev <- levels(x)
  x <- unclass(x)
  if (na.rm) {
    s <- !is.na(x + weights)
    x <- x[s, drop = FALSE]
    weights <- weights[s]
  }
  n <- length(x)
  if (normwt)
    weights <- weights * length(x)/sum(weights)
  i <- order(x)
  x <- x[i]
  weights <- weights[i]
  if (anyDuplicated(x)) {
    weights <- tapply(weights, x, sum)
    if (length(lev)) {
      levused <- lev[sort(unique(x))]
      if ((length(weights) > length(levused)) && any(is.na(weights)))
        weights <- weights[!is.na(weights)]
      if (length(weights) != length(levused))
        stop("program logic error")
      names(weights) <- levused
    }
    if (!length(names(weights)))
      stop("program logic error")
    if (type == "table")
      return(weights)
    #x <-  all.is.numeric(names(weights), "vector")#  modified: commented out. function checked whether all are numeric and if yes returned the weights
    if (isdate)      attributes(x) <- c(attributes(x), ax)
    x_out<- as.numeric(names(weights))
    names(weights) <- NULL
    return(list(x = x_out, sum.of.weights = weights))
  }
  xx <- x
  if (isdate)
    attributes(xx) <- c(attributes(xx), ax)
  if (type == "list")
    list(x = if (length(lev)) lev[x] else xx, sum.of.weights = weights)
  else {
    names(weights) <- if (length(lev))
      lev[x]
    else xx
    weights
  }
}

wtd.mean <- function (x, weights = NULL, normwt = "ignored", na.rm = TRUE) {
  # Function taken from HMISC, but issue solved which is documented here: https://github.com/harrelfe/Hmisc/issues/97#issuecomment-429634634
  if (!length(weights))
    return(mean(x, na.rm = na.rm))
  if (na.rm) {
    s <- !is.na(x + weights)
    x <- x[s]
    weights <- weights[s]
  }
  sum(weights * x)/sum(weights)
}

Stefan2015-5 avatar Jul 21 '20 11:07 Stefan2015-5

@harrelfe Has this issue been addressed in the latest version of Hmisc? The wtd.quantile function is pretty important to the work we do. Thanks for this package and rms!

dylanbeaudette avatar Oct 15 '20 04:10 dylanbeaudette

I'm using Hmisc to generate weighted boxplots with specific quantiles (.05, .25,.5,.75,.95), and I can't get it to to match with the output of STATA's fweight feature. Do you have any suggestions for creating a comparable function?

lrestrepo0001 avatar Nov 09 '20 21:11 lrestrepo0001

@Stefan2015-5 @harrelfe Hello :) any news about this?

Paul-SO avatar Nov 24 '20 13:11 Paul-SO

as already stated: "The functions are broken and the maintainer undiscerning (see thread). To me the package is dead." You are welcome to use my functions instead, where I fixed the issues.

Stefan2015-5 avatar Nov 24 '20 14:11 Stefan2015-5