rSVD icon indicating copy to clipboard operation
rSVD copied to clipboard

What is the expected amount of error?

Open const-ae opened this issue 6 months ago • 2 comments

Hi,

first of all, thanks for the cool package. I was playing around with randomized SVD to perform dimension reduction on sparse single-cell RNA-seq data, but was surprised that the results differed from irlba's and from the svd function.

To illustrate the issue, I generated a random matrix and compared the results with and without compression:

set.seed(1)
mat <- matrix(rnorm(100 * 500), nrow =100, ncol = 500)
v1 <- rsvd::rsvd(mat, k = 3, p = 30)$v
v2 <- rsvd::rsvd(mat, nu = 3, nv = 3)$v
plot(v1, v2); abline(0,1); abline(0, -1)

Created on 2023-12-13 with reprex v2.0.2

My questions are: Are the differences between the results expected? Is there a good adaptive way to set p to ensure that the results are within a reasonable bound of the true result?

Best, Constantin

const-ae avatar Dec 13 '23 09:12 const-ae

Hello Constantin,

The rsvd package assumes that the input matrix has low-rank structure. In the example above, you are generating a full-rank matrix. I don't expect rsvd to work in this case. Can you try to use the following code instead:

set.seed(1)
r = 10        
mat <- matrix(rnorm(100 * r), nrow =100, ncol = r) %*% matrix(rnorm(r * 500), nrow =r, ncol = 500)
v1 <- rsvd::rsvd(mat, k = 3, p = 30)$v
v2 <- rsvd::rsvd(mat, nu = 3, nv = 3)$v
plot(v1, v2); abline(0,1); abline(0, -1)

Here r denotes the rank of the matrix, and you can play around with it by increasing the rank. If r is large, then you need to increase the oversampling parameter p. But typically p=30 is a good choice in practice.

Best, Ben

erichson avatar Dec 22 '23 06:12 erichson

Hi Ben,

thanks for the explanation. My use case is somewhere in between the full-rank matrix and the low-rank example you gave. I work on high-dimensional (e.g., 10,000 x 50,000 observations) biological data (specifically, single-cell RNA-seq), where it is well established that the data lie on some lower dimensional manifold (typically assumed to be between 25-100 dimensional) and that each observation is corrupted by unstructured noise. However, there is no clear boundary between dimensions with true signal and dimensions containing only noise.

rsvd seems great because it can handle sparse input and is fast, but I find it difficult to judge if the disagreement with the optimal result could be relevant for downstream analysis. Of course, that question is beyond what we will be able to gauge here and will probably require some careful benchmarking. For me, it is already helpful to see that I am not making some basic error in how I call the function or miss something otherwise obvious.

Below is an example using a biologically interesing single-cell dataset. Thank you again for the package and the reply; feel free to close the issue :)


Single-cell example:

# BiocManager::install(c("muscData", "scuttle"))
sce <- muscData::Kang18_8vs8()
#> see ?muscData and browseVignettes('muscData') for documentation
#> loading from cache
mat <- SummarizedExperiment::assay(scuttle::logNormCounts(sce), "logcounts")
# Subsetting for speed
hvg <- order(-rowVars(mat))
mat <- mat[hvg[1:1e4],sample.int(ncol(mat), size = 2e4)]

system.time(
  pca1 <- irlba::irlba(t(as.matrix(mat)), nv = 30)
)
#> Warning in asMethod(object): sparse->dense coercion: allocating vector of size
#> 1.5 GiB
#>    user  system elapsed 
#>  46.175   1.669  48.359
system.time(
  pca2 <- rsvd::rsvd(t(mat), k = 30, p = 30)
)
#>    user  system elapsed 
#>   5.750   0.153   5.949
plot(pca1$v[,1], pca2$v[,1], main = "The first singular vector agrees well"); abline(0,1); abline(0, -1)

plot(pca1$v[,15], pca2$v[,15], main = "The 15th singular vector agrees less well"); abline(0,1); abline(0, -1)

plot(pca1$v[,30], pca2$v[,30], main = "The 30th singular vector shows little correlation"); abline(0,1); abline(0, -1)


total_var <- sum(mat^2)
plot(cumsum(pca1$d^2 / total_var), type = "l", main = "Fraction of variance explained")
lines(cumsum(pca2$d^2 / total_var), col = "red")

Created on 2024-01-02 with reprex v2.0.2

const-ae avatar Jan 02 '24 10:01 const-ae