sparseMatrixStats
sparseMatrixStats copied to clipboard
`row/colVars()` and `row/colSds()` methods for xgCMatrix objects silently ignore the supplied `center`
See for example the definition of the rowVars()
method:
library(sparseMatrixStats)
selectMethod("rowVars", "xgCMatrix")
# Method Definition:
#
# function (x, rows = NULL, cols = NULL, na.rm = FALSE, center = NULL, ...)
# {
# .local <- function (x, rows = NULL, cols = NULL, na.rm = FALSE)
# {
# if (!is.null(rows)) {
# x <- x[rows, , drop = FALSE]
# }
# if (!is.null(cols)) {
# x <- x[, cols, drop = FALSE]
# }
# dgCMatrix_rowVars(x, na_rm = na.rm)
# }
# .local(x, rows, cols, na.rm, ...)
# }
# <bytecode: 0x8146588>
# <environment: namespace:sparseMatrixStats>
#
# Signatures:
# x
# target "xgCMatrix"
# defined "xgCMatrix"
Here is a concrete example using a dgCMatrix object:
library(Matrix)
i <- 1:5
j <- 2:6
m0 <- sparseMatrix(i, j, x=10*i+j, dims=c(5, 7)) # dgCMatrix object
center
not supplied:
matrixStats::rowVars(as.matrix(m0))
# [1] 20.57143 75.57143 165.14286 289.28571 448.00000
rowVars(m0)
# [1] 20.57143 75.57143 165.14286 289.28571 448.00000
center
supplied:
matrixStats::rowVars(as.matrix(m0), center=0)
# [1] 24.00000 88.16667 192.66667 337.50000 522.66667
rowVars(m0, center=0) # 'center' is ignored!
# [1] 20.57143 75.57143 165.14286 289.28571 448.00000
Thanks, H.
sessionInfo():
R Under development (unstable) (2020-10-28 r79382)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.7 LTS
Matrix products: default
BLAS: /home/hpages/R/R-4.1.r79382/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.1.r79382/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Matrix_1.2-18 sparseMatrixStats_1.3.0 MatrixGenerics_1.3.0
[4] matrixStats_0.57.0
loaded via a namespace (and not attached):
[1] compiler_4.1.0 Rcpp_1.0.5 grid_4.1.0 lattice_0.20-41
Hi Hervé,
I understood Henrik in https://github.com/HenrikBengtsson/matrixStats/issues/183 that center must be an honest center estimate of the column means:
This calculation is based on the assumption that center is a proper estimate of the column means.
https://github.com/HenrikBengtsson/matrixStats/issues/183#issuecomment-705169785
Accordingly, rowVars(m0, center=0)
is undefined behavior.
I could of course replicate the behavior of matrixStats
, but as the result is non-sense, I am not sure what we would gain by that.
My thinking for not implementing the center
parameter was that at the moment center
is just a performance optimization when one calculates mean and var with matrixStats
and thus wasn't a priority for sparseMatrixStats
.
Best, Constantin
Hi Constantin,
My use case is to efficiently compute:
f <- function(v) {
v <- v[!is.na(v)]
sum(v^2) / max(1L, length(v) - 1L)
}
apply(m, 1L, f)
I was hoping I could do it with:
rowVars(m, center=0, na.rm=TRUE)
which works fine when m
is an ordinary matrix but not when it's a sparse matrix.
But that's ok, I guess I can use:
rowSums(m^2 / (rowSums(!is.na(m)) - 1L), na.rm=TRUE)
instead.
Doesn't seem that it would be too hard to generalize this to an arbitrary center
e.g. by doing something like this in the rowVars()
method for xgCMatrix objects:
if (is.null(center)) {
dgCMatrix_rowVars(x, na_rm=na.rm)
} else {
rowSums((m-center)^2) / (rowSums(!is.na(m)) - 1L, na.rm=na.rm)
}
Anyways, at the very least the methods in sparseMatrixStats should issue a warning or an error that center
is not supported. Silently ignoring the argument is no good.
Thanks!
P.S.: FWIW my use case comes from here: https://github.com/Bioconductor/DelayedArray/blob/aa9a74556f5c36307ceb638eb31199477139fea8/R/DelayedArray-utils.R#L857
Not that tx
will ever be a dgCMatrix object here but I was playing around a little with rowSds(m, center=0, na.rm=TRUE)
to try to get a sense of how much I can trust this and my conclusion was: not too much. This could easily hit someone who wants to implement a native scale()
method for dgCMatrix objects.
@hpages, I don't think this is a good idea. There are at least two different ways to calculate the variance and it's not safe to make assumptions on which one is used. This is normally not a problem when there's no center
argument. Your use case is one example of why I'm getting more and more concerned about exposing the center
argument. It might become used, or who knows is already used, in ways that give incorrect results, and then if we switch the calculation internally things might break silently.
I agree. Better not having it than having it do something that is not well defined.
I'll just mention that I've been obsessively using center=
with some precomputed row means under the assumption that it was more efficient. But upon inspecting the code, it's not always obvious that's the case, given that the center=
path contains an extra allocation of the same dimensions as x
. Some testing bears that out:
library(matrixStats)
y <- matrix(rnorm(1e8), ncol=1000)
system.time(out <- rowVars(y))
## user system elapsed
## 0.371 0.000 0.371
rm <- rowMeans(y)
system.time(out2 <- rowVars(y, center=rm))
## user system elapsed
## 0.298 0.144 0.441
(As an aside: in this case, using center=
gets better when we increase the number of columns - presumably the cache misses start to add up in the C_rowVars
function, beyond the cost of doing an extra allocation? In my own applications, I use Welford's algorithm to get around this via column-major processing - seems to work well enough.)
Anyway, the tl;dr is that there doesn't seem to be an unambiguous performance advantage to using center=
, so I guess I'd be happy to see it gone. Certainly it would make the lives of the downstream *MatrixStats developers easier.