correlation
correlation copied to clipboard
weighted corr
I need to use a weight correlation because some samples represent more population than others.
I know how to do a weighted correlation matrix but not using cor_test (e.g., cor.wt {psych}, or cov.wt(data %>% select(-weight), wt=data$weight, cor=TRUE)).
Would it be possible to implement a weighted (and pairwise) correlation here?
Super appreciate if there is help or recommendations.
That'd be nice indeed. Having a "weighted" method that would call psych::wt
seems relatively easy, but the limitation is that it'd be a separate method, not compatible with other ones. I wonder whether there is a simple and straightforward way to account fo weights independently of the correlation that is run (by modifying the data or something?), so that if weights are added, it could be used with any of the subsequent correlation methods.
I remember @strengejacke we mentioned correlation weights somewhere already, what do you think of that?
I also tried to explore weightedCorr {wCorr} and try to request a pull. The formula is consistent with what I found in the literature. But I can't think at all how to calculate significance and p-value.
function (x, y, w, method = c("Pearson", "Spearman"))
{
if (!is.numeric(x)) {
x <- as.numeric(x)
}
if (!is.numeric(y)) {
y <- as.numeric(y)
}
if (!is.numeric(w)) {
w <- as.numeric(w)
}
pm <- pmatch(tolower(method[[1]]), tolower(c("Pearson",
"Spearman")))
if (pm == 2) {
x <- wrank(x, w)
y <- wrank(y, w)
}
xb <- sum(w * x)/sum(w)
yb <- sum(w * y)/sum(w)
numerator <- sum(w * (x - xb) * (y - yb))
denom <- sqrt(sum(w * (x - xb)^2) * sum(w * (y - yb)^2))
numerator/denom
}
We can start by having a new method called weighted that we can put in its own separete file cor_test_weighted
, which will be called when method = "weighted"
(need also to add a weights
argument, which is NULL
by default. @abuabara would you like trying to make a PR to add that?
@DominiqueMakowski actually I'm not sure how to do that. I was checking the code after creating a fork, I don't know where it would be the best way to include it. If it’s an extra method, it would be a weighted Pearson, correct? But if it is an weighted option, we would have to update all methods to a weighted version, which I don't have references for. I'm not an statistician but I think that multiplying x and y by the weight value would be the most efficient way to get started, still keeping the same framework.
Yes, you're absolutely right, adding it as a new method would be a bit artificial since conceptually it's an "option" that could apply to all methods. That said, "multiplying x and y by the weight value" seems a bit too easy 🤔 (we recently implemented a robust get_weights()
functio in insight so we could use that @strengejacke)
Let me ping also @IndrajeetPatil because I saw that he was interested in weighted statistics ☺️ Any idea about how to add weighted options to correlations?
The weighting syntax used in stats::cov.wt
could pretty much be imported wholesale and applied to most methods:
> cov.wt
function (x, wt = rep(1/nrow(x), nrow(x)), cor = FALSE, center = TRUE,
method = c("unbiased", "ML"))
{
if (is.data.frame(x))
x <- as.matrix(x)
else if (!is.matrix(x))
stop("'x' must be a matrix or a data frame")
if (!all(is.finite(x)))
stop("'x' must contain finite values only")
n <- nrow(x)
if (with.wt <- !missing(wt)) {
if (length(wt) != n)
stop("length of 'wt' must equal the number of rows in 'x'")
if (any(wt < 0) || (s <- sum(wt)) == 0)
stop("weights must be non-negative and not all zero")
wt <- wt/s
}
if (is.logical(center)) {
center <- if (center)
colSums(wt * x)
else 0
}
else {
if (length(center) != ncol(x))
stop("length of 'center' must equal the number of columns in 'x'")
}
x <- sqrt(wt) * sweep(x, 2, center, check.margin = FALSE)
cov <- switch(match.arg(method), unbiased = crossprod(x)/(1 -
sum(wt^2)), ML = crossprod(x))
y <- list(cov = cov, center = center, n.obs = n)
if (with.wt)
y$wt <- wt
if (cor) {
Is <- 1/sqrt(diag(cov))
R <- cov
R[] <- Is * cov * rep(Is, each = nrow(cov))
y$cor <- R
}
y
}
<bytecode: 0x00000256f3dd9168> <environment: namespace:stats>
(we recently implemented a robust get_weights() functio in insight so we could use that @strengejacke)
get_weights()
onyl returns weights from a model object, not sure if we could use it in this context?
I think we should add a new weights
argument that takes a variable or vector which are multiplied by the cases before computation of the various correlations