corrr
corrr copied to clipboard
Proper usage of 2nd data input, y in correlate()
correlate()
allows for a 2nd data input, y. This raises the question of whether correlate is ever intended to be used with y differing from x.
It would be good to get some clarity/confirmation on how y was intended to be used.
I do not believe that y was intended to be used to supply data different to x for the 2 reasons that follow:
- if y has the same number of columns as x but different names, correlate incorrectly assigns the wrong row names to the correlation data frame output. Specifically, it sets the row names to be the same as the column names (which come from the names of y). The column names should come from y and the rownames should come from x.
> correlate(mtcars[,1:2], mtcars[,3:4])
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
# A tibble: 2 x 3
rowname disp hp
<chr> <dbl> <dbl>
1 disp NA -0.776
2 hp 0.902 NA
We can verify that the correlations shown are not correct for mtcars
(in addition to the fact that the matrix is not symmetric)
> cor(mtcars$hp, mtcars$disp)
[1] 0.7909486
We can see how the rows should be labeled using stats::cor()
, which is what correlate()
is using internally
> cor(mtcars[,1:2], mtcars[,3:4])
disp hp
mpg -0.8475514 -0.7761684
cyl 0.9020329 0.8324475
>
- if x and y have different numbers of columns, correlate fails as it is expecting a square matrix of correlations to be cast to
cor_df
> correlate(mtcars[1:3], mtcars[4:5])
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
Error in as_cordf(x, diagonal = diagonal) :
Input object x is not a square. The number of columns must be equal to the number of rows.
Incidentally, stats::cor
does allow x and y to have different numbers of variables:
> cor(mtcars[1:3], mtcars[4:5])
hp drat
mpg -0.7761684 0.6811719
cyl 0.8324475 -0.6999381
disp 0.7909486 -0.7102139
This outcome was not anticipated. The original intention for correlate was to wrap cor()
, providing the same correlation functionality but then convert the results to a data frame. I had not fully explored the outcomes of using y
in various ways. I suppose, given that it's taken until now for someone to report the problem, using y
is a rare thing in general. Regarldess, it's definitely a bug!
Options I can think of:
- Adjust
correlate()
to handley
properly. Though, will result in having to be able to return an object that is NOTcor_df
. - Remove support for
y
. It feels like the cost of keeping/fixing for it outweighs the benefits. However, this will be a breaking change.