Hmisc
Hmisc copied to clipboard
wtd.quantile
I noticed a very surprising result of the wtd.quantile function. If the weights are not integers, there seems to be an issue:
> require(Hmisc)
> data <- c(1,2,3)
> weights<- c(0.1, 0.1, 0.1)
> wtd.quantile(data, weights=weights, probs=0.5)
50%
3
> weights<- c(0.1, 0.1, 0.1)* 10
> wtd.quantile(data, weights=weights, probs=0.5)
50%
2
This is definitely an error. Code fix welcomed.
Checking the function, I see it is assuming weights are integers, like nr of observations. I saw the function that is currently implemented can not handle (i) any non-integer numbers in weights correctly and (ii) not remove NAs despite the na.rm
parameter. I think the easiest solution is to scale the weights up, a quick fix is here.
I marked new lines with ##### new
wtd.quantile<- function (x, weights = NULL, probs = c(0, 0.25, 0.5, 0.75, 1),
type = c("quantile", "(i-1)/(n-1)", "i/(n+1)", "i/n"), normwt = FALSE,
na.rm = TRUE) {
if (!length(weights)) return(quantile(x, probs = probs, na.rm = na.rm))
type <- match.arg(type)
if (any(probs < 0 | probs > 1)) stop("Probabilities must be between 0 and 1 inclusive")
nams <- paste(format(round(probs * 100, if (length(probs) >
1) 2 - log10(diff(range(probs))) else 2)), "%", sep = "")
if(na.rm & any(is.na(weights))){ ###### new
i<- is.na(weights)
x <- x[!i]
weights <- weights[!i]
}
i <- weights == 0
if (any(i)) {
x <- x[!i]
weights <- weights[!i]
}
if (type == "quantile") {
if(sum(weights) < 1000000 ) {weights<- weights*1000000/sum(weights)} ##### new
w <- wtd.table(x, weights, na.rm = na.rm, normwt = normwt,
type = "list")
x <- w$x
wts <- w$sum.of.weights
n <- sum(wts)
order <- 1 + (n - 1) * probs
low <- pmax(floor(order), 1)
high <- pmin(low + 1, n)
order <- order%%1
allq <- approx(cumsum(wts), x, xout = c(low, high), method = "constant",
f = 1, rule = 2)$y
k <- length(probs)
quantiles <- (1 - order) * allq[1:k] + order * allq[-(1:k)]
names(quantiles) <- nams
return(quantiles)
}
w <- wtd.Ecdf(x, weights, na.rm = na.rm, type = type, normwt = normwt)
structure(approx(w$ecdf, w$x, xout = probs, rule = 2)$y,
names = nams)
}
I apologize for not studying those more before my first reply. I think that in your original example you need to specify normwt=TRUE when weights do not add up to n. When you do that you get the proper median of 2.0. wtd.table handles fractional weights properly, so I don't think your changes are needed other than the more accurate way of dealing with na.rm.
No, even with normwt=TRUE
the function produces incorrect results. (On top of that I can't imagine a useful interpretation of the result when normwt=FALSE
)
require(Hmisc)
data <- 1:3
weights<- c(0.1, 0.1, 0.8)
probs=0.09
wtd.quantile(x=data, weights=weights, probs=probs, normwt=TRUE) #=3 ; incorrect.
wtd.quantile2(x=data, weights=weights, probs=probs) #=1; correct. this uses the modified function I posted above.
The issue is the one I outlined above. I suggest to drop the parameter normwt, or keep it just for backward compatability without use.
normwt=FALSE
is very seldom used but there is a legitimate use when weights represent actual case weights, i.e., an observation with a weight of 3 actually was sampled 3 times.
If wtd.quantile
fails, then either wtd.table
or wtd.ecdf
must be the culprit. Scaling weights up is not the ultimate solution and is not scalable to massive problems.
If normwt=FALSE
the code produces the very same results as normwt=TRUE
except in some cases where either normwt=FALSE
fails or both fail. To convince me otherwise please provide a counter example. But even if: "very seldom" is already making it clear: it should certainly not be the standard value.
"If wtd.quantile
fails, then either wtd.table
or wtd.ecdf
must be the culprit"
What makes you think that? From what I can see wtd.table
behaves just fine and wtd.ecdf
is not even called in the incorrect examples. Again: the calcs of the inside function objects order
, low
and high
are assuming weights to be positive integers (under normwt=FALSE
and normwt= TRUE
). I agree that "scaling up is not the ultimate solution" but if you do not want the whole function changed, it is the only I am seeing.
BTW: the function also does not handle negative weights (neither gives a warning, error or provides a meaningful result). I think the best would be to drop these together with zero weights.
I agree re: negative weights. And I agree in principle about changing the default for normwt
, but it's too late to do that now.
wtd.table
nowhere assumes integer weights, as 'frequencies' are sums of weights. I suspect the problem is more related to the rules that are used for computing quantiles when the quantile being requested is a a value where there are ties in the data, or the wtd.table method is not interpolated correctly to give the requested quantile.
Any update on this?
I find the following case to have unexpected result.
>>> x <- c(1,2)
>>> weights <- c(1,9)
>>> wtd.quantile(x, weights=weights, normwt=TRUE, prob=0)
0%
2
I believe I can help fixing.
The functions are broken and the maintainer undiscerning (see thread). To me the package is dead. Use my functions instead, where I fixed the issues.
wtd.quantile<- function (x, weights = NULL, probs = c(0, 0.25, 0.5, 0.75, 1),
type = c("quantile", "(i-1)/(n-1)", "i/(n+1)", "i/n"),
na.rm = TRUE) {
# Function taken from HMISC, but issue solved which is documented here: https://github.com/harrelfe/Hmisc/issues/97#issuecomment-429634634
normwt = FALSE
if (!length(weights)) return(quantile(x, probs = probs, na.rm = na.rm))
type <- match.arg(type)
if (any(probs < 0 | probs > 1)) stop("Probabilities must be between 0 and 1 inclusive")
nams <- paste(format(round(probs * 100, if (length(probs) >
1) 2 - log10(diff(range(probs))) else 2)), "%", sep = "")
if(na.rm & any(is.na(weights))){ ###### new
i<- is.na(weights)
x <- x[!i]
weights <- weights[!i]
}
i <- weights <= 0 # nwe: kill negative and zero weights and associated data
if (any(i)) {
x <- x[!i]
weights <- weights[!i]
}
if (type == "quantile") {
if(sum(weights) < 1000000 ) {weights<- weights*1000000/sum(weights)} ##### new
w <- wtd.table(x, weights, na.rm = na.rm, normwt = normwt,
type = "list")
x <- w$x
wts <- w$sum.of.weights
n <- sum(wts)
order <- 1 + (n - 1) * probs
low <- pmax(floor(order), 1)
high <- pmin(low + 1, n)
order <- order%%1
allq <- approx(cumsum(wts), x, xout = c(low, high), method = "constant",
f = 1, rule = 2)$y
k <- length(probs)
quantiles <- (1 - order) * allq[1:k] + order * allq[-(1:k)]
names(quantiles) <- nams
return(quantiles)
}
w <- wtd.Ecdf(x, weights, na.rm = na.rm, type = type, normwt = normwt)
structure(approx(w$ecdf, w$x, xout = probs, rule = 2)$y,
names = nams)
}
wtd.table<- function (x, weights = NULL, type = c("list", "table"), normwt = FALSE,
na.rm = TRUE) {
# Function taken from HMISC, but issue solved which is documented here: https://github.com/harrelfe/Hmisc/issues/97#issuecomment-429634634
p_load(timeDate)
type <- match.arg(type)
if (!length(weights))
weights <- rep(1, length(x))
isdate <- ( class(x)[1]=="Date" | class(x)[1]=="POSIXct") ### MODIFIED
ax <- attributes(x)
ax$names <- NULL
if (is.character(x))
x <- as.factor(x)
lev <- levels(x)
x <- unclass(x)
if (na.rm) {
s <- !is.na(x + weights)
x <- x[s, drop = FALSE]
weights <- weights[s]
}
n <- length(x)
if (normwt)
weights <- weights * length(x)/sum(weights)
i <- order(x)
x <- x[i]
weights <- weights[i]
if (anyDuplicated(x)) {
weights <- tapply(weights, x, sum)
if (length(lev)) {
levused <- lev[sort(unique(x))]
if ((length(weights) > length(levused)) && any(is.na(weights)))
weights <- weights[!is.na(weights)]
if (length(weights) != length(levused))
stop("program logic error")
names(weights) <- levused
}
if (!length(names(weights)))
stop("program logic error")
if (type == "table")
return(weights)
#x <- all.is.numeric(names(weights), "vector")# modified: commented out. function checked whether all are numeric and if yes returned the weights
if (isdate) attributes(x) <- c(attributes(x), ax)
x_out<- as.numeric(names(weights))
names(weights) <- NULL
return(list(x = x_out, sum.of.weights = weights))
}
xx <- x
if (isdate)
attributes(xx) <- c(attributes(xx), ax)
if (type == "list")
list(x = if (length(lev)) lev[x] else xx, sum.of.weights = weights)
else {
names(weights) <- if (length(lev))
lev[x]
else xx
weights
}
}
wtd.mean <- function (x, weights = NULL, normwt = "ignored", na.rm = TRUE) {
# Function taken from HMISC, but issue solved which is documented here: https://github.com/harrelfe/Hmisc/issues/97#issuecomment-429634634
if (!length(weights))
return(mean(x, na.rm = na.rm))
if (na.rm) {
s <- !is.na(x + weights)
x <- x[s]
weights <- weights[s]
}
sum(weights * x)/sum(weights)
}
@harrelfe Has this issue been addressed in the latest version of Hmisc? The wtd.quantile
function is pretty important to the work we do. Thanks for this package and rms
!
I'm using Hmisc to generate weighted boxplots with specific quantiles (.05, .25,.5,.75,.95), and I can't get it to to match with the output of STATA's fweight feature. Do you have any suggestions for creating a comparable function?
@Stefan2015-5 @harrelfe Hello :) any news about this?
as already stated: "The functions are broken and the maintainer undiscerning (see thread). To me the package is dead." You are welcome to use my functions instead, where I fixed the issues.