corrr Proper usage of 2nd data input, y in correlate()

Proper usage of 2nd data input, y in correlate()

Open thisisdaryn opened this issue 4 years ago • 1 comments

correlate() allows for a 2nd data input, y. This raises the question of whether correlate is ever intended to be used with y differing from x.

It would be good to get some clarity/confirmation on how y was intended to be used.

I do not believe that y was intended to be used to supply data different to x for the 2 reasons that follow:

if y has the same number of columns as x but different names, correlate incorrectly assigns the wrong row names to the correlation data frame output. Specifically, it sets the row names to be the same as the column names (which come from the names of y). The column names should come from y and the rownames should come from x.

> correlate(mtcars[,1:2], mtcars[,3:4])

Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'

# A tibble: 2 x 3
  rowname   disp     hp
  <chr>    <dbl>  <dbl>
1 disp    NA     -0.776
2 hp       0.902 NA

We can verify that the correlations shown are not correct for mtcars (in addition to the fact that the matrix is not symmetric)

> cor(mtcars$hp, mtcars$disp)
[1] 0.7909486

We can see how the rows should be labeled using stats::cor(), which is what correlate() is using internally

> cor(mtcars[,1:2], mtcars[,3:4])
          disp         hp
mpg -0.8475514 -0.7761684
cyl  0.9020329  0.8324475
>

if x and y have different numbers of columns, correlate fails as it is expecting a square matrix of correlations to be cast to cor_df

> correlate(mtcars[1:3], mtcars[4:5])

Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'

 Error in as_cordf(x, diagonal = diagonal) : 
  Input object x is not a square. The number of columns must be equal to the number of rows.

Incidentally, stats::cor does allow x and y to have different numbers of variables:

> cor(mtcars[1:3], mtcars[4:5])
             hp       drat
mpg  -0.7761684  0.6811719
cyl   0.8324475 -0.6999381
disp  0.7909486 -0.7102139

Aug 17 '20 20:08 thisisdaryn

This outcome was not anticipated. The original intention for correlate was to wrap cor(), providing the same correlation functionality but then convert the results to a data frame. I had not fully explored the outcomes of using y in various ways. I suppose, given that it's taken until now for someone to report the problem, using y is a rare thing in general. Regarldess, it's definitely a bug!

Options I can think of:

Adjust correlate() to handle y properly. Though, will result in having to be able to return an object that is NOT cor_df.
Remove support for y. It feels like the cost of keeping/fixing for it outweighs the benefits. However, this will be a breaking change.

Aug 26 '20 06:08 drsimonj

corrr corrr copied to clipboard

Proper usage of 2nd data input, y in correlate()

corrr
corrr copied to clipboard