matrixStats icon indicating copy to clipboard operation
matrixStats copied to clipboard

Inconsistent use of names in rowWeightedMeans()

Open const-ae opened this issue 4 years ago • 2 comments

Hi Henrik,

I noticed that rowWeightedMeans() with w != NULL is the only of the weighted functions, that doesn't return a named vector:

mat <- matrix(1:32, nrow = 8, ncol = 4)
rownames(mat) <- paste0("row_", 1:8)
colnames(mat) <- paste0("col_", 1:4)
rowweights <- runif(4)
colweights <- runif(8)
mat
#>       col_1 col_2 col_3 col_4
#> row_1     1     9    17    25
#> row_2     2    10    18    26
#> row_3     3    11    19    27
#> row_4     4    12    20    28
#> row_5     5    13    21    29
#> row_6     6    14    22    30
#> row_7     7    15    23    31
#> row_8     8    16    24    32

matrixStats::colWeightedMeans(mat, w = NULL)
#> col_1 col_2 col_3 col_4 
#>   4.5  12.5  20.5  28.5
matrixStats::colWeightedMeans(mat, w = colweights)
#>     col_1     col_2     col_3     col_4 
#>  4.495905 12.495905 20.495905 28.495905

matrixStats::rowWeightedMeans(mat, w = NULL)
#> row_1 row_2 row_3 row_4 row_5 row_6 row_7 row_8 
#>    13    14    15    16    17    18    19    20
matrixStats::rowWeightedMeans(mat, w = rowweights)
#> [1] 12.38391 13.38391 14.38391 15.38391 16.38391 17.38391 18.38391 19.38391


matrixStats::rowWeightedVars(mat, w = NULL)
#>    row_1    row_2    row_3    row_4    row_5    row_6    row_7    row_8 
#> 106.6667 106.6667 106.6667 106.6667 106.6667 106.6667 106.6667 106.6667
matrixStats::rowWeightedVars(mat, w = rowweights)
#>    row_1    row_2    row_3    row_4    row_5    row_6    row_7    row_8 
#> 108.1507 108.1507 108.1507 108.1507 108.1507 108.1507 108.1507 108.1507

Created on 2020-05-05 by the reprex package (v0.3.0)

Is there a specific reason for this behavior?

Best, Constantin

const-ae avatar May 05 '20 09:05 const-ae

Is there a specific reason for this behavior?

No reason other than names have been more or less ignored. To be more specific, names have not been on the radar and intentionally not been considered for any of the functions that are implemented in native code. The functions that return names are those that are implemented in R and use functions that already return names. In hindsight, it would probably have been safer to explicitly drop the names throughout and only then start considering how to support them (as below).

A way forward here is to start adding support for names in a consistent way. Here are some thoughts:

  • The objective of matrixStats is to maximize speed and minimize memory use.

  • The handling of names comes with overhead. Not dealing with names will always be faster.

  • There are use cases where names are not of interest. To maximize performance, it should always be possible to not handle or produce names. This might require the introduction of a new argument , e.g. useNames = FALSE/TRUE.

  • The added penalty for useNames = FALSE would basically only be if (useNames) { ... }. It could also be that there's room for performance gains for some functions by dropping names attributes before calling help functions, e.g. using if (!useNames) { dimnames(x) <- NULL } in colWeightedMeans() might avoid some overhead from the names attributes later in the code.

  • It's not clear what the default should be. Whatever the default will be, some function will change it's current behavior. I think, setting useNames = FALSE would be the least surprising because that's been the implicit (non-documented) design and that should be how most functions work already.

  • Handling of names can probably be done at the R level without too much performance loss, i.e. it's likely to be almost as fast as if we do it in native code.

  • Adding formal support for names will require lots of new package tests.

HenrikBengtsson avatar May 24 '20 20:05 HenrikBengtsson

Hey, thank you for the comprehensive explanation. I agree with your points above and introducing useNames as an additional argument as the clearest solution, but also see why it would be a lot of effort for limited gain.

I just wondered about this specific case, because when I tried to match the naming behavior with sparseMatrixStats, this was the only place where I had to adapt my tests (https://github.com/const-ae/sparseMatrixStats/commit/ab9d7eaf1b18cd3ad995b88bd7734af387f49912#diff-bbdb18ef963a2c609c183eb84920e804R203) :)

const-ae avatar May 26 '20 11:05 const-ae

Just checking in on this old report. With matrixStats 0.63.0-9011, which uses useNames = TRUE by default, all functions return named results:

> mat <- matrix(1:32, nrow = 8, ncol = 4)
rownames(mat) <- paste0("row_", 1:8)
colnames(mat) <- paste0("col_", 1:4)
rowweights <- runif(4)
colweights <- runif(8)
> matrixStats::colWeightedMeans(mat, w = NULL)
col_1 col_2 col_3 col_4 
  4.5  12.5  20.5  28.5 
> matrixStats::colWeightedMeans(mat, w = colweights)
    col_1     col_2     col_3     col_4 
 5.508512 13.508512 21.508512 29.508512 
> matrixStats::rowWeightedMeans(mat, w = NULL)
row_1 row_2 row_3 row_4 row_5 row_6 row_7 row_8 
   13    14    15    16    17    18    19    20 
> matrixStats::rowWeightedMeans(mat, w = rowweights)
   row_1    row_2    row_3    row_4    row_5    row_6    row_7    row_8 
12.36754 13.36754 14.36754 15.36754 16.36754 17.36754 18.36754 19.36754 
> matrixStats::rowWeightedVars(mat, w = NULL)
   row_1    row_2    row_3    row_4    row_5    row_6    row_7    row_8 
106.6667 106.6667 106.6667 106.6667 106.6667 106.6667 106.6667 106.6667 
> matrixStats::rowWeightedVars(mat, w = rowweights)
   row_1    row_2    row_3    row_4    row_5    row_6    row_7    row_8 
175.0556 175.0556 175.0556 175.0556 175.0556 175.0556 175.0556 175.0556 

HenrikBengtsson avatar Jun 01 '23 16:06 HenrikBengtsson

Great. I have already adapted my unit tests and all tests pass with matrixStats version 0.63.0-9011.

const-ae avatar Jun 01 '23 16:06 const-ae

Excellent

HenrikBengtsson avatar Jun 01 '23 16:06 HenrikBengtsson