correlation icon indicating copy to clipboard operation
correlation copied to clipboard

weighted corr

Open abuabara opened this issue 4 years ago • 8 comments

I need to use a weight correlation because some samples represent more population than others.

I know how to do a weighted correlation matrix but not using cor_test (e.g., cor.wt {psych}, or cov.wt(data %>% select(-weight), wt=data$weight, cor=TRUE)).

Would it be possible to implement a weighted (and pairwise) correlation here?

Super appreciate if there is help or recommendations.

abuabara avatar Feb 28 '21 17:02 abuabara

That'd be nice indeed. Having a "weighted" method that would call psych::wt seems relatively easy, but the limitation is that it'd be a separate method, not compatible with other ones. I wonder whether there is a simple and straightforward way to account fo weights independently of the correlation that is run (by modifying the data or something?), so that if weights are added, it could be used with any of the subsequent correlation methods.

I remember @strengejacke we mentioned correlation weights somewhere already, what do you think of that?

DominiqueMakowski avatar Feb 28 '21 23:02 DominiqueMakowski

I also tried to explore weightedCorr {wCorr} and try to request a pull. The formula is consistent with what I found in the literature. But I can't think at all how to calculate significance and p-value.

function (x, y, w, method = c("Pearson", "Spearman")) 
{
   if (!is.numeric(x)) {
      x <- as.numeric(x)
   }
   if (!is.numeric(y)) {
      y <- as.numeric(y)
   }
   if (!is.numeric(w)) {
      w <- as.numeric(w)
   }
   pm <- pmatch(tolower(method[[1]]), tolower(c("Pearson", 
      "Spearman")))
   if (pm == 2) {
      x <- wrank(x, w)
      y <- wrank(y, w)
   }
   xb <- sum(w * x)/sum(w)
   yb <- sum(w * y)/sum(w)
   numerator <- sum(w * (x - xb) * (y - yb))
   denom <- sqrt(sum(w * (x - xb)^2) * sum(w * (y - yb)^2))
   numerator/denom
}

abuabara avatar Mar 01 '21 02:03 abuabara

We can start by having a new method called weighted that we can put in its own separete file cor_test_weighted, which will be called when method = "weighted" (need also to add a weights argument, which is NULL by default. @abuabara would you like trying to make a PR to add that?

DominiqueMakowski avatar Mar 01 '21 04:03 DominiqueMakowski

@DominiqueMakowski actually I'm not sure how to do that. I was checking the code after creating a fork, I don't know where it would be the best way to include it. If it’s an extra method, it would be a weighted Pearson, correct? But if it is an weighted option, we would have to update all methods to a weighted version, which I don't have references for. I'm not an statistician but I think that multiplying x and y by the weight value would be the most efficient way to get started, still keeping the same framework.

abuabara avatar Mar 08 '21 17:03 abuabara

Yes, you're absolutely right, adding it as a new method would be a bit artificial since conceptually it's an "option" that could apply to all methods. That said, "multiplying x and y by the weight value" seems a bit too easy 🤔 (we recently implemented a robust get_weights() functio in insight so we could use that @strengejacke)

Let me ping also @IndrajeetPatil because I saw that he was interested in weighted statistics ☺️ Any idea about how to add weighted options to correlations?

DominiqueMakowski avatar Mar 09 '21 01:03 DominiqueMakowski

The weighting syntax used in stats::cov.wt could pretty much be imported wholesale and applied to most methods:

> cov.wt
function (x, wt = rep(1/nrow(x), nrow(x)), cor = FALSE, center = TRUE, 
    method = c("unbiased", "ML")) 
{
    if (is.data.frame(x)) 
        x <- as.matrix(x)
    else if (!is.matrix(x)) 
        stop("'x' must be a matrix or a data frame")
    if (!all(is.finite(x))) 
        stop("'x' must contain finite values only")
    n <- nrow(x)
    if (with.wt <- !missing(wt)) {
        if (length(wt) != n) 
            stop("length of 'wt' must equal the number of rows in 'x'")
        if (any(wt < 0) || (s <- sum(wt)) == 0) 
            stop("weights must be non-negative and not all zero")
        wt <- wt/s
    }
    if (is.logical(center)) {
        center <- if (center) 
            colSums(wt * x)
        else 0
    }
    else {
        if (length(center) != ncol(x)) 
            stop("length of 'center' must equal the number of columns in 'x'")
    }
    x <- sqrt(wt) * sweep(x, 2, center, check.margin = FALSE)
    cov <- switch(match.arg(method), unbiased = crossprod(x)/(1 - 
        sum(wt^2)), ML = crossprod(x))
    y <- list(cov = cov, center = center, n.obs = n)
    if (with.wt) 
        y$wt <- wt
    if (cor) {
        Is <- 1/sqrt(diag(cov))
        R <- cov
        R[] <- Is * cov * rep(Is, each = nrow(cov))
        y$cor <- R
    }
    y
}

<bytecode: 0x00000256f3dd9168> <environment: namespace:stats>

bwiernik avatar Apr 24 '21 13:04 bwiernik

(we recently implemented a robust get_weights() functio in insight so we could use that @strengejacke)

get_weights() onyl returns weights from a model object, not sure if we could use it in this context?

strengejacke avatar Apr 27 '21 06:04 strengejacke

I think we should add a new weights argument that takes a variable or vector which are multiplied by the cases before computation of the various correlations

bwiernik avatar Apr 27 '21 09:04 bwiernik