matrixStats
matrixStats copied to clipboard
Inconsistent use of names in rowWeightedMeans()
Hi Henrik,
I noticed that rowWeightedMeans()
with w != NULL
is the only of the weighted functions, that doesn't return a named vector:
mat <- matrix(1:32, nrow = 8, ncol = 4)
rownames(mat) <- paste0("row_", 1:8)
colnames(mat) <- paste0("col_", 1:4)
rowweights <- runif(4)
colweights <- runif(8)
mat
#> col_1 col_2 col_3 col_4
#> row_1 1 9 17 25
#> row_2 2 10 18 26
#> row_3 3 11 19 27
#> row_4 4 12 20 28
#> row_5 5 13 21 29
#> row_6 6 14 22 30
#> row_7 7 15 23 31
#> row_8 8 16 24 32
matrixStats::colWeightedMeans(mat, w = NULL)
#> col_1 col_2 col_3 col_4
#> 4.5 12.5 20.5 28.5
matrixStats::colWeightedMeans(mat, w = colweights)
#> col_1 col_2 col_3 col_4
#> 4.495905 12.495905 20.495905 28.495905
matrixStats::rowWeightedMeans(mat, w = NULL)
#> row_1 row_2 row_3 row_4 row_5 row_6 row_7 row_8
#> 13 14 15 16 17 18 19 20
matrixStats::rowWeightedMeans(mat, w = rowweights)
#> [1] 12.38391 13.38391 14.38391 15.38391 16.38391 17.38391 18.38391 19.38391
matrixStats::rowWeightedVars(mat, w = NULL)
#> row_1 row_2 row_3 row_4 row_5 row_6 row_7 row_8
#> 106.6667 106.6667 106.6667 106.6667 106.6667 106.6667 106.6667 106.6667
matrixStats::rowWeightedVars(mat, w = rowweights)
#> row_1 row_2 row_3 row_4 row_5 row_6 row_7 row_8
#> 108.1507 108.1507 108.1507 108.1507 108.1507 108.1507 108.1507 108.1507
Created on 2020-05-05 by the reprex package (v0.3.0)
Is there a specific reason for this behavior?
Best, Constantin
Is there a specific reason for this behavior?
No reason other than names have been more or less ignored. To be more specific, names have not been on the radar and intentionally not been considered for any of the functions that are implemented in native code. The functions that return names are those that are implemented in R and use functions that already return names. In hindsight, it would probably have been safer to explicitly drop the names throughout and only then start considering how to support them (as below).
A way forward here is to start adding support for names in a consistent way. Here are some thoughts:
-
The objective of matrixStats is to maximize speed and minimize memory use.
-
The handling of names comes with overhead. Not dealing with names will always be faster.
-
There are use cases where names are not of interest. To maximize performance, it should always be possible to not handle or produce names. This might require the introduction of a new argument , e.g.
useNames = FALSE/TRUE
. -
The added penalty for
useNames = FALSE
would basically only beif (useNames) { ... }
. It could also be that there's room for performance gains for some functions by dropping names attributes before calling help functions, e.g. usingif (!useNames) { dimnames(x) <- NULL }
incolWeightedMeans()
might avoid some overhead from the names attributes later in the code. -
It's not clear what the default should be. Whatever the default will be, some function will change it's current behavior. I think, setting
useNames = FALSE
would be the least surprising because that's been the implicit (non-documented) design and that should be how most functions work already. -
Handling of names can probably be done at the R level without too much performance loss, i.e. it's likely to be almost as fast as if we do it in native code.
-
Adding formal support for names will require lots of new package tests.
Hey, thank you for the comprehensive explanation. I agree with your points above and introducing useNames
as an additional argument as the clearest solution, but also see why it would be a lot of effort for limited gain.
I just wondered about this specific case, because when I tried to match the naming behavior with sparseMatrixStats, this was the only place where I had to adapt my tests (https://github.com/const-ae/sparseMatrixStats/commit/ab9d7eaf1b18cd3ad995b88bd7734af387f49912#diff-bbdb18ef963a2c609c183eb84920e804R203) :)
Just checking in on this old report. With matrixStats 0.63.0-9011, which uses useNames = TRUE
by default, all functions return named results:
> mat <- matrix(1:32, nrow = 8, ncol = 4)
rownames(mat) <- paste0("row_", 1:8)
colnames(mat) <- paste0("col_", 1:4)
rowweights <- runif(4)
colweights <- runif(8)
> matrixStats::colWeightedMeans(mat, w = NULL)
col_1 col_2 col_3 col_4
4.5 12.5 20.5 28.5
> matrixStats::colWeightedMeans(mat, w = colweights)
col_1 col_2 col_3 col_4
5.508512 13.508512 21.508512 29.508512
> matrixStats::rowWeightedMeans(mat, w = NULL)
row_1 row_2 row_3 row_4 row_5 row_6 row_7 row_8
13 14 15 16 17 18 19 20
> matrixStats::rowWeightedMeans(mat, w = rowweights)
row_1 row_2 row_3 row_4 row_5 row_6 row_7 row_8
12.36754 13.36754 14.36754 15.36754 16.36754 17.36754 18.36754 19.36754
> matrixStats::rowWeightedVars(mat, w = NULL)
row_1 row_2 row_3 row_4 row_5 row_6 row_7 row_8
106.6667 106.6667 106.6667 106.6667 106.6667 106.6667 106.6667 106.6667
> matrixStats::rowWeightedVars(mat, w = rowweights)
row_1 row_2 row_3 row_4 row_5 row_6 row_7 row_8
175.0556 175.0556 175.0556 175.0556 175.0556 175.0556 175.0556 175.0556
Great. I have already adapted my unit tests and all tests pass with matrixStats
version 0.63.0-9011.
Excellent