correlation
correlation copied to clipboard
Handling of factors with `include_factors`
Hello, Many thanks for your easystats project!
In the latest version of correlation on CRAN, I encounter an unexpected behavior when including factors:
mtcars$am <- factor(mtcars$am)
correlation::correlation(mtcars[, c("am", "mpg")], include_factors=TRUE)
Parameter1 | Parameter2 | r | 95% CI | t | df | p | Method | n_Obs
----------------------------------------------------------------------------------------
am.0 | am.1 | -1.00 | [-1.00, -1.00] | -Inf | 30 | < .001 | Pearson | 32
am.0 | mpg | -0.60 | [-0.78, -0.32] | -4.11 | 30 | < .001 | Pearson | 32
am.1 | mpg | 0.60 | [ 0.32, 0.78] | 4.11 | 30 | < .001 | Pearson | 32
I was expecting a biserial correlation of am
with mpg
like here:
mtcars$am <- as.numeric(factor(mtcars$am))
correlation::correlation(mtcars[, c("am", "mpg")], include_factors=TRUE)
Parameter1 | Parameter2 | r | 95% CI | t | df | p | Method | n_Obs
------------------------------------------------------------------------------------
am | mpg | 0.60 | [0.32, 0.78] | 4.11 | 30 | < .001 | Pearson | 32
Do I anything wrong here? Is this behavior expected?
Many thanks for your help!
Hi @mutlusun
You were probably expecting a point-biserial correlation, which is pretty much a Pearson correlation. This is what has been obtained in your second result (because am
has only 2 levels, 0 and 1, a Pearson correlation with this variable is de facto a point-biserial correlation).
You can observe that the results of your second analysis are identical to the ones of the first. What happened when using include_factors
is that am
got split into 2 separate variables am.0
and am.1
(dummy variables). In your case, because there are only two levels, they are redundant. Hence the pattern that you observe.
If you want a true Biserial correlation (which is a bit different from a point-biserial), you'll need to specify the method:
library(correlation)
df <- mtcars
df$am <- as.factor(df$am)
cor_test(df, "mpg", "am", method="biserial")
Parameter1 | Parameter2 | rho | 95% CI | t | df | p | Method | n_Obs
-------------------------------------------------------------------------------------
mpg | am | 0.75 | [0.54, 0.87] | 6.06 | 30 | < .001 | Biserial | 32
Hope it helps!
Hi @DominiqueMakowski ,
thanks for your fast reply and your clarifications! I agree with you that the results are correct. My issue was more about the output style.
In case of the correlations
function it would be nice to convert factors automatically to numerics (maybe an additional argument would be feasible?) and choose a point-biserial correlation if there are only two levels.
However, that's only a suggestion and not a real issue. So if you want to keep the function as it is, we can close this issue.
So you mean essentially setting include_factors=TRUE
to true by default.
I'm not 100% convinced yet, as I'd tend to think that most of users would expect factors to be omitted by default, as correlations between factors are not really intuitive for the majority of people (?)
Let's leave that open for now, in case others want to give their input ☺️
Sorry, for my short reply. I did not intend to set include_factors
to true by default.
I would propose the following defaults:
df <- mtcars[, c("am", "cyl", "mpg", "disp")]
df$am <- factor(df$am) # has two levels
df$cyl <- factor(df$cyl) # has 3 levels
correlation::correlation(df, include_factors=FALSE, convert_factors_to_numeric=TRUE)
Parameter1 | Parameter2 | r | 95% CI | t | df | p | Method | n_Obs
----------------------------------------------------------------------------------------
mpg | disp | -0.85 | [-0.92, -0.71] | -8.75 | 30 | < .001 | Pearson | 32
However, setting include_factors
to true returns the following result:
correlation::correlation(df, include_factors=TRUE, convert_factors_to_numeric=TRUE)
Parameter1 | Parameter2 | r | 95% CI | t | df | p | Method | n_Obs
----------------------------------------------------------------------------------------
mpg | disp | -0.85 | [-0.92, -0.71] | -8.75 | 30 | < .001 | Pearson | 32
mpg | am | 0.75 | [0.54, 0.87] | 6.06 | 30 | < .001 | Biserial | 32
cyl | am | -0.52 | [-0.74, -0.21] | -3.36 | 30 | 0.002 | Point-biserial | 32
cyl | mpg | -0.85 | [-0.93, -0.72] | -8.92 | 30 | < .001 | Pearson | 32
cyl | disp | 0.90 | [0.81, 0.95] | 11.45 | 30 | < .001 | Pearson | 32
Instead of the correlation coefficients above, probably a correlation coefficent for ordinal variables may be more appropriate (like Spearmans rho or Kendalls tau). Maybe it's more reasonable to treat factors as ordinal than continuous. I think this output is a bit easier to interpret than the current output.
Setting convert_factors_to_numeric
will return the original result:
correlation::correlation(df, include_factors=TRUE, convert_factors_to_numeric=FALSE)
Parameter1 | Parameter2 | r | 95% CI | t | df | p | Method | n_Obs
----------------------------------------------------------------------------------------
am.0 | am.1 | -1.00 | [-1.00, -1.00] | -Inf | 30 | < .001 | Pearson | 32
am.0 | cyl.4 | -0.47 | [-0.71, -0.15] | -2.94 | 30 | 0.062 | Pearson | 32
am.0 | cyl.6 | -0.02 | [-0.37, 0.33] | -0.13 | 30 | 1.000 | Pearson | 32
am.0 | cyl.8 | 0.47 | [ 0.15, 0.71] | 2.94 | 30 | 0.062 | Pearson | 32
am.0 | mpg | -0.60 | [-0.78, -0.32] | -4.11 | 30 | 0.004 | Pearson | 32
am.0 | disp | 0.59 | [ 0.31, 0.78] | 4.02 | 30 | 0.004 | Pearson | 32
am.1 | cyl.4 | 0.47 | [ 0.15, 0.71] | 2.94 | 30 | 0.062 | Pearson | 32
am.1 | cyl.6 | 0.02 | [-0.33, 0.37] | 0.13 | 30 | 1.000 | Pearson | 32
am.1 | cyl.8 | -0.47 | [-0.71, -0.15] | -2.94 | 30 | 0.062 | Pearson | 32
am.1 | mpg | 0.60 | [ 0.32, 0.78] | 4.11 | 30 | 0.004 | Pearson | 32
am.1 | disp | -0.59 | [-0.78, -0.31] | -4.02 | 30 | 0.004 | Pearson | 32
cyl.4 | cyl.6 | -0.38 | [-0.65, -0.04] | -2.27 | 30 | 0.153 | Pearson | 32
cyl.4 | cyl.8 | -0.64 | [-0.81, -0.37] | -4.54 | 30 | 0.001 | Pearson | 32
cyl.4 | mpg | 0.80 | [ 0.63, 0.90] | 7.35 | 30 | < .001 | Pearson | 32
cyl.4 | disp | -0.75 | [-0.87, -0.54] | -6.12 | 30 | < .001 | Pearson | 32
cyl.6 | cyl.8 | -0.47 | [-0.70, -0.14] | -2.89 | 30 | 0.062 | Pearson | 32
cyl.6 | mpg | -0.03 | [-0.38, 0.32] | -0.17 | 30 | 1.000 | Pearson | 32
cyl.6 | disp | -0.21 | [-0.52, 0.15] | -1.15 | 30 | 1.000 | Pearson | 32
cyl.8 | mpg | -0.74 | [-0.87, -0.53] | -6.06 | 30 | < .001 | Pearson | 32
cyl.8 | disp | 0.88 | [ 0.78, 0.94] | 10.40 | 30 | < .001 | Pearson | 32
mpg | disp | -0.85 | [-0.92, -0.71] | -8.75 | 30 | < .001 | Pearson | 32
What do you think about it?
Thanks for your thoughts!
I think the current behavior is good. There are 3 things that can be done with factors when computing correlations:
- Omit them and/or error (default behavior, what
stats::cor()
does) - Somehow infer a relationship to a continuous variable, such as computing a polyserial or polychoric correlation or even just converting to integers and pretending it's a metric variable.
- Dummy code the levels and compute correlations with the dummy variables.
- Compute a nominal correlation statistic like Cramer's V.
Any of these options are reasonable choices, but 2-4 each involve some important assumptions and might be preferred in different situations. That makes (1) a good default.
(2) might be appropriate if assumptions about the latent variable distribution are appropriate. They are available by specifying an appropriate method
argument.
(3) would be preferred if the variable is included as a dummy coded variable in a regression model. Indeed correlation()
is the only R function I know that makes constructing a correlation matrix with dummy coded variables straightforward.
(4) might be preferred if the correlation table is intended as the sole summary of variable relationships, though I personally think such statistics lose too much nuance in the data.