correlation icon indicating copy to clipboard operation
correlation copied to clipboard

Handling of factors with `include_factors`

Open mutlusun opened this issue 4 years ago • 5 comments

Hello, Many thanks for your easystats project!

In the latest version of correlation on CRAN, I encounter an unexpected behavior when including factors:

mtcars$am <- factor(mtcars$am)
correlation::correlation(mtcars[, c("am", "mpg")], include_factors=TRUE)

Parameter1 | Parameter2 |     r |         95% CI |     t | df |      p |  Method | n_Obs
----------------------------------------------------------------------------------------
am.0       |       am.1 | -1.00 | [-1.00, -1.00] |  -Inf | 30 | < .001 | Pearson |    32
am.0       |        mpg | -0.60 | [-0.78, -0.32] | -4.11 | 30 | < .001 | Pearson |    32
am.1       |        mpg |  0.60 | [ 0.32,  0.78] |  4.11 | 30 | < .001 | Pearson |    32

I was expecting a biserial correlation of am with mpg like here:

mtcars$am <- as.numeric(factor(mtcars$am))
correlation::correlation(mtcars[, c("am", "mpg")], include_factors=TRUE)

Parameter1 | Parameter2 |    r |       95% CI |    t | df |      p |  Method | n_Obs
------------------------------------------------------------------------------------
am         |        mpg | 0.60 | [0.32, 0.78] | 4.11 | 30 | < .001 | Pearson |    32

Do I anything wrong here? Is this behavior expected?

Many thanks for your help!

mutlusun avatar May 18 '20 08:05 mutlusun

Hi @mutlusun

You were probably expecting a point-biserial correlation, which is pretty much a Pearson correlation. This is what has been obtained in your second result (because am has only 2 levels, 0 and 1, a Pearson correlation with this variable is de facto a point-biserial correlation).

You can observe that the results of your second analysis are identical to the ones of the first. What happened when using include_factors is that am got split into 2 separate variables am.0 and am.1 (dummy variables). In your case, because there are only two levels, they are redundant. Hence the pattern that you observe.

If you want a true Biserial correlation (which is a bit different from a point-biserial), you'll need to specify the method:

library(correlation)

df <- mtcars
df$am <- as.factor(df$am)

cor_test(df, "mpg", "am", method="biserial")

Parameter1 | Parameter2 |  rho |       95% CI |    t | df |      p |   Method | n_Obs
-------------------------------------------------------------------------------------
mpg        |         am | 0.75 | [0.54, 0.87] | 6.06 | 30 | < .001 | Biserial |    32

Hope it helps!

DominiqueMakowski avatar May 18 '20 12:05 DominiqueMakowski

Hi @DominiqueMakowski , thanks for your fast reply and your clarifications! I agree with you that the results are correct. My issue was more about the output style. In case of the correlations function it would be nice to convert factors automatically to numerics (maybe an additional argument would be feasible?) and choose a point-biserial correlation if there are only two levels.

However, that's only a suggestion and not a real issue. So if you want to keep the function as it is, we can close this issue.

mutlusun avatar May 21 '20 08:05 mutlusun

So you mean essentially setting include_factors=TRUE to true by default.

I'm not 100% convinced yet, as I'd tend to think that most of users would expect factors to be omitted by default, as correlations between factors are not really intuitive for the majority of people (?)

Let's leave that open for now, in case others want to give their input ☺️

DominiqueMakowski avatar May 21 '20 09:05 DominiqueMakowski

Sorry, for my short reply. I did not intend to set include_factors to true by default. I would propose the following defaults:

df <- mtcars[, c("am", "cyl", "mpg", "disp")]

df$am <- factor(df$am) # has two levels
df$cyl <- factor(df$cyl) # has 3 levels

correlation::correlation(df, include_factors=FALSE, convert_factors_to_numeric=TRUE)

Parameter1 | Parameter2 |     r |         95% CI |     t | df |      p |  Method | n_Obs
----------------------------------------------------------------------------------------
mpg        |       disp | -0.85 | [-0.92, -0.71] | -8.75 | 30 | < .001 | Pearson |    32

However, setting include_factors to true returns the following result:

correlation::correlation(df, include_factors=TRUE, convert_factors_to_numeric=TRUE)
Parameter1 | Parameter2 |     r |         95% CI |     t | df |      p |  Method | n_Obs
----------------------------------------------------------------------------------------
mpg        |       disp | -0.85 | [-0.92, -0.71] | -8.75 | 30 | < .001 | Pearson |    32
mpg        |         am | 0.75 | [0.54, 0.87] | 6.06 | 30 | < .001 | Biserial |    32
cyl        |         am | -0.52 | [-0.74, -0.21] | -3.36 | 30 | 0.002 | Point-biserial |    32
cyl        |        mpg | -0.85 | [-0.93, -0.72] | -8.92 | 30 | < .001 | Pearson |    32
cyl        |       disp | 0.90 | [0.81, 0.95] | 11.45 | 30 | < .001 | Pearson |    32

Instead of the correlation coefficients above, probably a correlation coefficent for ordinal variables may be more appropriate (like Spearmans rho or Kendalls tau). Maybe it's more reasonable to treat factors as ordinal than continuous. I think this output is a bit easier to interpret than the current output.

Setting convert_factors_to_numeric will return the original result:

correlation::correlation(df, include_factors=TRUE, convert_factors_to_numeric=FALSE)
Parameter1 | Parameter2 |     r |         95% CI |     t | df |      p |  Method | n_Obs
----------------------------------------------------------------------------------------
am.0       |       am.1 | -1.00 | [-1.00, -1.00] |  -Inf | 30 | < .001 | Pearson |    32
am.0       |      cyl.4 | -0.47 | [-0.71, -0.15] | -2.94 | 30 | 0.062  | Pearson |    32
am.0       |      cyl.6 | -0.02 | [-0.37,  0.33] | -0.13 | 30 | 1.000  | Pearson |    32
am.0       |      cyl.8 |  0.47 | [ 0.15,  0.71] |  2.94 | 30 | 0.062  | Pearson |    32
am.0       |        mpg | -0.60 | [-0.78, -0.32] | -4.11 | 30 | 0.004  | Pearson |    32
am.0       |       disp |  0.59 | [ 0.31,  0.78] |  4.02 | 30 | 0.004  | Pearson |    32
am.1       |      cyl.4 |  0.47 | [ 0.15,  0.71] |  2.94 | 30 | 0.062  | Pearson |    32
am.1       |      cyl.6 |  0.02 | [-0.33,  0.37] |  0.13 | 30 | 1.000  | Pearson |    32
am.1       |      cyl.8 | -0.47 | [-0.71, -0.15] | -2.94 | 30 | 0.062  | Pearson |    32
am.1       |        mpg |  0.60 | [ 0.32,  0.78] |  4.11 | 30 | 0.004  | Pearson |    32
am.1       |       disp | -0.59 | [-0.78, -0.31] | -4.02 | 30 | 0.004  | Pearson |    32
cyl.4      |      cyl.6 | -0.38 | [-0.65, -0.04] | -2.27 | 30 | 0.153  | Pearson |    32
cyl.4      |      cyl.8 | -0.64 | [-0.81, -0.37] | -4.54 | 30 | 0.001  | Pearson |    32
cyl.4      |        mpg |  0.80 | [ 0.63,  0.90] |  7.35 | 30 | < .001 | Pearson |    32
cyl.4      |       disp | -0.75 | [-0.87, -0.54] | -6.12 | 30 | < .001 | Pearson |    32
cyl.6      |      cyl.8 | -0.47 | [-0.70, -0.14] | -2.89 | 30 | 0.062  | Pearson |    32
cyl.6      |        mpg | -0.03 | [-0.38,  0.32] | -0.17 | 30 | 1.000  | Pearson |    32
cyl.6      |       disp | -0.21 | [-0.52,  0.15] | -1.15 | 30 | 1.000  | Pearson |    32
cyl.8      |        mpg | -0.74 | [-0.87, -0.53] | -6.06 | 30 | < .001 | Pearson |    32
cyl.8      |       disp |  0.88 | [ 0.78,  0.94] | 10.40 | 30 | < .001 | Pearson |    32
mpg        |       disp | -0.85 | [-0.92, -0.71] | -8.75 | 30 | < .001 | Pearson |    32

What do you think about it?

Thanks for your thoughts!

mutlusun avatar May 22 '20 08:05 mutlusun

I think the current behavior is good. There are 3 things that can be done with factors when computing correlations:

  1. Omit them and/or error (default behavior, what stats::cor() does)
  2. Somehow infer a relationship to a continuous variable, such as computing a polyserial or polychoric correlation or even just converting to integers and pretending it's a metric variable.
  3. Dummy code the levels and compute correlations with the dummy variables.
  4. Compute a nominal correlation statistic like Cramer's V.

Any of these options are reasonable choices, but 2-4 each involve some important assumptions and might be preferred in different situations. That makes (1) a good default.

(2) might be appropriate if assumptions about the latent variable distribution are appropriate. They are available by specifying an appropriate method argument. (3) would be preferred if the variable is included as a dummy coded variable in a regression model. Indeed correlation() is the only R function I know that makes constructing a correlation matrix with dummy coded variables straightforward. (4) might be preferred if the correlation table is intended as the sole summary of variable relationships, though I personally think such statistics lose too much nuance in the data.

bwiernik avatar Apr 23 '21 14:04 bwiernik