correlation icon indicating copy to clipboard operation
correlation copied to clipboard

partial r hypothesis testing does not account for uncertainty in residualizing

Open mattansb opened this issue 4 years ago ā€¢ 18 comments

Note the following examples:

res <- correlation::correlation(mtcars, partial = TRUE)
res[res$Parameter1=="mpg",]
#> Parameter1 | Parameter2 |     r |     t | df |     p |         95% CI |  Method | n_Obs
#> ---------------------------------------------------------------------------------------
#> mpg        |        cyl | -0.02 | -0.13 | 30 | 1.000 | [-0.37,  0.33] | Pearson |    32
#> mpg        |       disp |  0.16 |  0.89 | 30 | 1.000 | [-0.20,  0.48] | Pearson |    32
#> mpg        |         hp | -0.21 | -1.18 | 30 | 1.000 | [-0.52,  0.15] | Pearson |    32
#> mpg        |       drat |  0.10 |  0.58 | 30 | 1.000 | [-0.25,  0.44] | Pearson |    32
#> mpg        |         wt | -0.39 | -2.34 | 30 | 1.000 | [-0.65, -0.05] | Pearson |    32
#> mpg        |       qsec |  0.24 |  1.34 | 30 | 1.000 | [-0.12,  0.54] | Pearson |    32
#> mpg        |         vs |  0.03 |  0.18 | 30 | 1.000 | [-0.32,  0.38] | Pearson |    32
#> mpg        |         am |  0.26 |  1.46 | 30 | 1.000 | [-0.10,  0.56] | Pearson |    32
#> mpg        |       gear |  0.10 |  0.52 | 30 | 1.000 | [-0.26,  0.43] | Pearson |    32
#> mpg        |       carb | -0.05 | -0.29 | 30 | 1.000 | [-0.39,  0.30] | Pearson |    32

res <- ppcor::pcor(mtcars)
data.frame(r = res$estimate[-1,1],
           t = res$statistic[-1,1],
           p = res$p.value[-1,1])
#>                r          t          p
#> cyl  -0.02326429 -0.1066392 0.91608738
#> disp  0.16083460  0.7467585 0.46348865
#> hp   -0.21052027 -0.9868407 0.33495531
#> drat  0.10445452  0.4813036 0.63527790
#> wt   -0.39344938 -1.9611887 0.06325215
#> qsec  0.23809863  1.1234133 0.27394127
#> vs    0.03293117  0.1509915 0.88142347
#> am    0.25832849  1.2254035 0.23398971
#> gear  0.09534261  0.4389142 0.66520643
#> carb -0.05243662 -0.2406258 0.81217871

Created on 2020-04-06 by the reprex package (v0.3.0)

The resulting partial correlations are identical, but the t values are not (and by extension so are the CIs, and the unadjusted p values). Why?

Because correlation() computes partial correlations by residualizing variables, and then computing the correlations between them. But the df of the residualizing process - that is, the degree of uncertainty in estimating the residuals - is not accounted for. (Note that this should be true for Bayesian partial correlations as well - the priors and likelihood of the residualizing process are not accounted for).

Solutions:

  • Account for these. [HARD]
  • Update the docs to explicitly mention this - that inference and CIs are conditional on, and do not account for the uncertainty in estimating the residuals. [EASY]

mattansb avatar Apr 06 '20 05:04 mattansb

res <- correlation::correlation(mtcars, partial = TRUE, p_adjust = "none")
as.data.frame(res[res$Parameter1=="mpg", c("Parameter2", "r", "t", "p")])
#>    Parameter2           r          t          p
#> 1         cyl -0.02326429 -0.1274582 0.89942825
#> 2        disp  0.16083460  0.8925471 0.37920361
#> 3          hp -0.21052027 -1.1795002 0.24746855
#> 4        drat  0.10445452  0.5752679 0.56940012
#> 5          wt -0.39344938 -2.3440688 0.02589012
#> 6        qsec  0.23809863  1.3427357 0.18942961
#> 7          vs  0.03293117  0.1804693 0.85799776
#> 8          am  0.25832849  1.4646374 0.15342166
#> 9        gear  0.09534261  0.5246028 0.60371458
#> 10       carb -0.05243662 -0.2876029 0.77562818

res <- ppcor::pcor(mtcars)
data.frame(r = res$estimate[-1,1],
           t = res$statistic[-1,1],
           p = res$p.value[-1,1])
#>                r          t          p
#> cyl  -0.02326429 -0.1066392 0.91608738
#> disp  0.16083460  0.7467585 0.46348865
#> hp   -0.21052027 -0.9868407 0.33495531
#> drat  0.10445452  0.4813036 0.63527790
#> wt   -0.39344938 -1.9611887 0.06325215
#> qsec  0.23809863  1.1234133 0.27394127
#> vs    0.03293117  0.1509915 0.88142347
#> am    0.25832849  1.2254035 0.23398971
#> gear  0.09534261  0.4389142 0.66520643
#> carb -0.05243662 -0.2406258 0.81217871

Created on 2020-04-06 by the reprex package (v0.3.0)

good point. Should we recompute the t a posteriori based on the partial r?

Otherwise, if that's too complicated we should just remove them from the output as it doesn't make sense to add in indices that are "false"?

DominiqueMakowski avatar Apr 06 '20 06:04 DominiqueMakowski

Should we recompute the t a posteriori based on the partial r?

How would you do this?

Otherwise, if that's too complicated we should just remove them from the output as it doesn't make sense to add in indices that are "false"?

I think if we cannot provide the correct stats, this would be best for the frequentist r.

However for the Bayesian r, not sure what to do as the whole posterior distribution is biased in some way.... So any index (pd, BF, pRope) would biased as well...

mattansb avatar Apr 06 '20 06:04 mattansb

How would you do this

effectsize::r_to_t? We have the be r and the df, so in theory we should be able to retrieve the t?

But wait hold on, now that I think of it, our partial correlations are not just computed by transforming matrices, but by really computing the new correlations based on data partialization. So in fact the t and everything should be correctly computed? I.e., we do not compute the partial r from the raw r (as we could do through correlation::cor_to_pcor). So I wonder whether it is not ppcor that has issues?

DominiqueMakowski avatar Apr 06 '20 06:04 DominiqueMakowski

But you don't have the correct df - you have the df of the simple correlation between two vectors. The functions doesn't know if the vectors are raw scores, or transformed scores, or residualized scores.

Which is exactly what you're saying here:

our partial correlations are not just computed by transforming matrices, but by really computing the new correlations based on data partialization.

Which is why the t and everything aren't computed correctly. ppcor does take all of this into account, and so produces the correct t etc, which should be the same as those produced by lm() (the frequentist test for a partial correlation is equal to the frequentist for a slope in multiple regression):

res <- ppcor::pcor(mtcars)
data.frame(t = res$statistic[-1,1],
           p = res$p.value[-1,1])
#>               t          p
#> cyl  -0.1066392 0.91608738
#> disp  0.7467585 0.46348865
#> hp   -0.9868407 0.33495531
#> drat  0.4813036 0.63527790
#> wt   -1.9611887 0.06325215
#> qsec  1.1234133 0.27394127
#> vs    0.1509915 0.88142347
#> am    1.2254035 0.23398971
#> gear  0.4389142 0.66520643
#> carb -0.2406258 0.81217871

summary(lm(mpg ~ ., mtcars))$coef[-1,c(3,4)]
#>         t value   Pr(>|t|)
#> cyl  -0.1066392 0.91608738
#> disp  0.7467585 0.46348865
#> hp   -0.9868407 0.33495531
#> drat  0.4813036 0.63527790
#> wt   -1.9611887 0.06325215
#> qsec  1.1234133 0.27394127
#> vs    0.1509915 0.88142347
#> am    1.2254035 0.23398971
#> gear  0.4389142 0.66520643
#> carb -0.2406258 0.81217871

Created on 2020-04-06 by the reprex package (v0.3.0)

mattansb avatar Apr 06 '20 06:04 mattansb

All of this is also true for cor_to_ci / cor_to_p - the results of which are only true for simple correlations.

This is also somewhat true for correlation(..., multilevel = TRUE) (but I've seen multilevel correlations tested without any concern for this issue).

mattansb avatar Apr 06 '20 07:04 mattansb

But you don't have the correct df

You're right mmmh that's tricky due to the way partial correlations are created. We could maybe retrieve the right df by counting how many variables there is in the data that underwent partialization?

DominiqueMakowski avatar Apr 06 '20 07:04 DominiqueMakowski

Hmmm... that would work!

(Recall that effectsize::r_to_t was removed a while back... So you'd have to build it as an internal function here.)

res <- ppcor::pcor(mtcars)

r_to_t <- function(r, df_error) {
  r * sqrt(df_error / (1 - r ^ 2))
}

df <- res$n - nrow(res$estimate)

r_to_t(res$estimate[-1,1], df)
#>        cyl       disp         hp       drat         wt       qsec         vs 
#> -0.1066392  0.7467585 -0.9868407  0.4813036 -1.9611887  1.1234133  0.1509915 
#>         am       gear       carb 
#>  1.2254035  0.4389142 -0.2406258
res$statistic[-1,1]
#>        cyl       disp         hp       drat         wt       qsec         vs 
#> -0.1066392  0.7467585 -0.9868407  0.4813036 -1.9611887  1.1234133  0.1509915 
#>         am       gear       carb 
#>  1.2254035  0.4389142 -0.2406258

Created on 2020-04-06 by the reprex package (v0.3.0)

mattansb avatar Apr 06 '20 07:04 mattansb

Continuing https://github.com/easystats/correlation/commit/464544d3d25f585ff9724795dea06058ba429cac here:

With r and df you can build back the t value, the p value, and the CI - so essentially I think you wouldn't need to even use stats::cor.test:

  • Use cor to get the r.
  • Then with n and n_parm get the df ( = n - n_parm - 1)
  • Then with r and df get the t ( = r * sqrt(df_error / (1 - r ^ 2)))
  • Then with t and df
  • get the p.value ( = 2 * stats::pt(abs(t), df, lower.tail = FALSE) )
  • get the CI with effectsize::t_to_r(t, df).

This won't work for Kendall's Tau..., but will for Pearson and Spearman. Is patrial = TRUE supported by any of the other methods? Not sure if / how it is solvable for them....

mattansb avatar Apr 06 '20 08:04 mattansb

Is patrial = TRUE supported by any of the other methods? Not sure if / how it is solvable for them....

https://github.com/easystats/correlation/blob/99ef225a16c8e803a59546fe41fe28b22ce5b670/R/cor_test.R#L78-L81

Since initialization is pretty much-taking place here, it's "compatible" with all methods...

But we can start with re-writing .cor_test_freq first so that it takes n_param as an argument and remove the call to cor.test

DominiqueMakowski avatar Apr 06 '20 08:04 DominiqueMakowski

Hmmmm... I think it would perhaps be better to not support this at all and thoroughly document that partialling is done by residualizing and exactly how the residualizing is done, than to partially support it...

I mean, residualizing is a thing people do, without accounting for the df of the residualizing process...

I don't know...

mattansb avatar Apr 06 '20 08:04 mattansb

This is also true... I mean here the df, t etc. are not wrong per se, it's more like they consider that the partialization is an independent step and that the data obtained from it is the data... (which isn't completely wrong depending on how you conceptualize it)

So yeah its quite tricky šŸ˜•

DominiqueMakowski avatar Apr 06 '20 09:04 DominiqueMakowski

True... any thoughts @IndrajeetPatil @strengejacke ?

mattansb avatar Apr 06 '20 10:04 mattansb

Is my explanation okay?

DominiqueMakowski avatar Apr 08 '20 10:04 DominiqueMakowski

any thoughts

No. My only understanding of correlation is: the higher, the higher - or the higher, the lower.

strengejacke avatar Apr 08 '20 11:04 strengejacke

šŸ’„šŸ˜²šŸ’„

DominiqueMakowski avatar Apr 08 '20 11:04 DominiqueMakowski

@DominiqueMakowski I would change

This is why ~small~ some discrepancies are to be expected...

As this can be quite a large discrepancy, depending on:

  1. How many covariates are partialled out.
  2. The strength of the correlation between X, Y and the partialled out covariates (the tolerance).

mattansb avatar Apr 11 '20 09:04 mattansb

The df are wrong here. When partial = TRUE, df needs to subtract out the number of partialed variables. Compare the psych and ppcor results:

cr <- correlation::correlation(iris[,1:4], partial = TRUE, p_adjust = "none")
crs <- summary(cr, redundant = TRUE)

pp <- ppcor::pcor(iris[,1:4])

# 2 is number of partialed variables (pp$gp)
ps <- psych::corr.p(psych::partial.r(iris[,1:4]), 150 - 2, adjust = "none")

# r
as.matrix(cr)
pp$estimate
ps$r

# t
attributes(crs)$t
pp$statistic
ps$t

# p
attributes(crs)$p
pp$p
ps$p

bwiernik avatar Apr 24 '21 14:04 bwiernik

I mean, residualizing is a thing people do, without accounting for the df of the residualizing process... I don't know...

This might be common, but it's not correct. The distribution of r/zā€² is pretty well captured by df = N - 2 - np (number of variables partialed). This applies well to Pearson, Spearman, Kendall (though they each have their own SE formula). Not sure about other methdos.

bwiernik avatar Apr 24 '21 14:04 bwiernik