BPCells icon indicating copy to clipboard operation
BPCells copied to clipboard

Crossprod and/or covariance matrix of IterableMatrix

Open kollo97 opened this issue 11 months ago • 6 comments

Hi,

first of all, thank you for this great package! For my project I'm looking into gene-gene correlations from scRNA-seq data, and I'd be super interested in doing all the analysis using BPCells. For that purpose, I was wondering what the best way is to calculate the covariance matrix of an IterableMatrix would be?

Best, Aaron

kollo97 avatar Feb 05 '25 10:02 kollo97

Hi Aaron, thanks for your question! There's not a built-in function to do this right now in BPCells, but it's possible to calculate by combining the existing supported operations. I've provided a code example below to calculate a covariance matrix -- option 1 calculates the full covariance matrix directly into a base R matrix object, and option 2 uses some internal BPCells functions to make a IterableMatrix object that holds the covariance matrix. Option 2 is useful if you only want to extract a subset of the matrix entries at a time, as it will perform only the minimum calculations required to produce a requested subset.

Either way, the result might be somewhat large -- the final output should require n_genes * n_genes * 8 bytes of memory (e.g. 3.2GB for 20k genes x 20k genes), and option 1 might require a few times that memory.

I'm not quite sure how long this will take for a large dataset as I haven't run it before. I'd recommend testing first with 1% and 10% of your cells to get a ballpark sense of timing before you run your full-scale analysis. And if you can report back how long it took and how many cells+genes you used that would be useful information to have for the future (if we want to add this as built-in BPCells functionality)

EDIT: Another tip for having this run fast is if you can make sure that the files are stored on SSD storage that is best. On laptops you shouldn't have to worry, but on clusters you might want to try copying your input files to the fastest local storage option available. This is because the sparse-sparse matrix multiply performs some random (non-sequential) reads that can perform slowly on network-based file systems (e.g. some benchmarks here)

If you have a bunch of memory available or all you want is say covariance of 20k genes x 50 genes then there might be faster alternative options.

Anyhow, here's the code example:

# Set up a mini test example - assume rows are genes
set.seed(125124)
x <- matrix(runif(12), nrow=4)
# Convert to an IterableMatrix
bp_x <- x |> as("dgCMatrix") |> as("IterableMatrix")

# This is required to be able to have the storage order match between bp_x and bp_xt
# You might want to use the `outdir` argument of `transpose_storage_order()` to save
# this matrix where it can later be loaded with `open_matrix_dir()`
bp_xt <- transpose_storage_order(t(bp_x))

# Option 1: do most of the math with base R matrices
ans1 <- (
    as.matrix(bp_x %*% bp_xt) / (ncol(bp_x)-1) - 
    tcrossprod(rowMeans(bp_x)) * ncol(bp_x) / (ncol(bp_x) - 1)
)

all.equal(ans1, cov(t(x)))

# Option 2: Do everything fully disk-backed until getting out results (uses some internal functions)
cov_mat <- bp_x %*% (bp_xt / (ncol(bp_x) - 1))
row_means <- rowMeans(bp_x)

# The TransformLinearResidual class is uesed in `BPCells::regress_out()`,
# and calculates x - t(row_params) %*% col_params
ans2 <- BPCells:::wrapMatrix(
    "TransformLinearResidual", 
    cov_mat, 
    row_params=t(row_means), 
    col_params=t(row_means) * ncol(bp_x)/(ncol(bp_x)-1)
)

all.equal(as.matrix(ans2), cov(t(x)))
# The nice thing about ans2 is you can subset prior to running `as.matrix` 
# to only calculate the requested matrix entries
all.equal(as.matrix(ans2[2:3,1:2]), cov(t(x))[2:3,1:2])

bnprks avatar Feb 05 '25 20:02 bnprks

Hi Ben, Thank you first of all for this thorough and helpful answer, and sorry for the really late reply, it totally fell under the radar..! Being able to only extract a subset of the matrix entries at a time using option 2 is indeed convenient. Currently, I'd like to compute the entire covariance matrix though. I tested the code on a 10k x 10k matrix and also compared it to using my current implementation with the bigstatsr package (https://github.com/privefl/bigstatsr), which also implements a version of a file-backed matrix and some matrix calculations:

library(bigstatsr)
library(BPCells)
library(bench)
# Set up a mini test example - assume rows are genes
set.seed(125124)

x <- matrix(runif(10000 * 10000), nrow = 10000)
# x <- matrix(runif(12), nrow = 4)
# Convert to an IterableMatrix
bp_x <- x |>
    as("dgCMatrix") |>
    as("IterableMatrix")
# This is required to be able to have the storage order match between bp_x and bp_xt
# You might want to use the `outdir` argument of `transpose_storage_order()` to save
# this matrix where it can later be loaded with `open_matrix_dir()`
bp_xt <- transpose_storage_order(t(bp_x))

cov_mat_bpcells <- function(bp_x, bp_xt) {
    cov_mat <- bp_x %*% (bp_xt / (ncol(bp_x) - 1))
    row_means <- rowMeans(bp_x)
    # The TransformLinearResidual class is uesed in `BPCells::regress_out()`,
    # and calculates x - t(row_params) %*% col_params
    bp_cov_mat <- BPCells:::wrapMatrix(
        "TransformLinearResidual",
        cov_mat,
        row_params = t(row_means),
        col_params = t(row_means) * ncol(bp_x) / (ncol(bp_x) - 1)
    )
    return(as.matrix(bp_cov_mat))
}

benchm <- bench::mark(
    bigstatsr_ans <- big_crossprodSelf(as_FBM(t(x)), 
    fun.scaling = big_scale(center = TRUE, scale = FALSE))[] / (ncol(x) - 1),
    ans1 <- (
        as.matrix(bp_x %*% bp_xt) / (ncol(bp_x) - 1) -
            tcrossprod(rowMeans(bp_x)) * ncol(bp_x) / (ncol(bp_x) - 1)
    ),
    ans2 <- cov_mat_bpcells(bp_x, bp_xt),
    base_ans <- cov(t(x)),
    check = FALSE
)
benchm
all.equal(bigstatsr_ans[1:5, 1:5], base_ans[1:5, 1:5])
# TRUE
all.equal(ans1[1:5, 1:5], base_ans[1:5, 1:5])
# TRUE
all.equal(ans2[1:5, 1:5], base_ans[1:5, 1:5])
# TRUE

benchm:

expression min median itr/sec mem_alloc gc/sec n_itr n_gc total_time
bigstatsr_ans 16.8s 16.8s 0.0595 NA 0.238 1 4 16.8s
ans1 2.43h 2.43h 0.000114 NA 0.000458 1 4 2.43h
ans2 2.47h 2.47h 0.000113 NA 0.000338 1 3 2.47h
base_ans 16.06m 16.06m 0.00104 NA 0.00104 1 1 16.06m

Thanks again and I hope this helps!

Best regards, Aaron

kollo97 avatar Apr 07 '25 08:04 kollo97

Hi @kollo97, thanks for following up with some more detail and benchmarking. The solution I suggested to you obviously performs quite poorly on your mini benchmark, and it looks like it might be worth providing a dedicated operator to do this in BPCells -- I've started coding one up already and will touch base when it is ready.

Separately, I think your example benchmark is maybe not the most realistic for a single cell use-case. In a scRNA counts matrix, about 95% of the entries are zero, and accounting for the sparsity in a matrix-matrix multiply case like this can reduce the number of required operations by 400x at 95% sparsity. BPCells definitely benefits from sparsity but many other libraries won't -- I'm not familiar enough with bigstatsr to know if it possibly would benefit from sparsity.

It's also quite tricky to assess how much of the benchmark is actually taking place in a disk-backed manner, since a matrix of this size will easily fit within an operating system's file cache meaning that even file reads will be served from memory rather than from disk. That's maybe a harder problem to solve, as a clean benchmark would require a test dataset too large to fit in the memory of the benchmarking computer.

With that in mind, I'd suggest the following as benchmark with more realistic sparsity structure. I've also provide a somewhat more optimal base R option, which seems to run much faster than cov() when you have a reasonable BLAS library installed (note the multi-threading causing much faster elapsed vs. CPU time)

library(BPCells)
set.seed(125124)
d <- 10000
# Make a matrix with 95% zero entries
x <- matrix(rpois(d * d, lambda=-log(0.95)), nrow = d)
bp_x <- x |> as("dgCMatrix") |> as("IterableMatrix")
bp_xt <- transpose_storage_order(t(bp_x))

system.time({
    ans1 <- as.matrix(bp_x %*% bp_xt) / (ncol(bp_x) - 1) -
            tcrossprod(rowMeans(bp_x)) * ncol(bp_x) / (ncol(bp_x) - 1)
})
#   user  system elapsed 
# 16.212   4.903  18.583 

system.time({
  x_center <- x - rowMeans(x)
  ans2 <- tcrossprod(x_center) / (d-1)
})
#   user  system elapsed 
# 69.076  11.720   6.195 

all.equal(ans1, ans2)
#[1] TRUE

With a realistic sparsity level, BPCells does this calculation in 18.5 seconds. In an actually disk-backed situation I don't expect BPCells would perform quite this well, hence it maybe being worth adding a dedicated operator.

bnprks avatar Apr 10 '25 06:04 bnprks

Hi @bnprks,

Thanks again for a great answer! I was not aware that BPCells benefits this strongly from the sparsity of scRNA-seq counts, it makes sense and is pretty awesome. I also share your concerns regarding the quality/cleanliness of the benchmark - I guess for now it's alright not to explore this too thoroughly.

Looking forward to your dedicated operator once you got it coded up!

Best, Aaron

kollo97 avatar May 13 '25 09:05 kollo97

Hi @kollo97, we've been a little slow getting this reviewed and merged, but the code is available in PR #243. I believe if you want to install a test version, it should be possible with:

remotes::install_github("bnprks/BPCells/r", ref="bp/crossproduct").

It provides functions cor_dense(), cov_dense(), crossprod_dense(), tcrossprod_dense() which match the interfaces of the corresponding base R functions, but taking in BPCells matrices as inputs.

To give a quick sense of the performance, I re-ran the demo sparse matrix benchmark from above on a new computer getting:

  • Previous BPCells option: 10.3 seconds elapsed, single-threaded
  • Previous R tcrossprod: 1.7 seconds elapsed, 24 threads
  • New option system.time({ans3 <- cov_dense(t(bp_x))}): 2.8 seconds elapsed single-threaded
  • New option system.time({ans3 <- cov_dense(t(bp_x, threads=4))}): 1.8 seconds elapsed, 4-threaded

Very significantly, the new version only requires a single sequential read through the input matrix, making it so that the performance will be reliably good even on a large disk-backed dataset. With 1.3M cells, I measured under 1 minute single-threaded and under 15 seconds with 8 threads to calculate a covariance matrix with 2k genes.

bnprks avatar May 14 '25 06:05 bnprks

Hi @bnprks, Awesome, thanks a lot for implementing this so quickly! I'll sure try it and get back to you.

Best, Aaron

kollo97 avatar May 15 '25 08:05 kollo97