BPCells [r][cpp] Add `sparse = TRUE` as an option for `pseudobulk

Hi, @bnprks, I added an option sparse = TRUE/FALSE to pseudobulk_matrix().

Motivation

Previous pseudobulk_matrix only returns dense matrices for the aggregated data. It looks great enough when the number of cell groups is small. However, when cell_groups is large (e.g., due to many meta-cells), the resulting pseudobulk matrix can become extremely wide. In such cases, returning a dense matrix can lead to significant memory overhead, even though most entries are zeros. This PR addresses that issue by allowing pseudobulk_matrix() to return a sparse matrix, greatly improving memory efficiency for large and sparse groupings.

Changes Made

[r] Added a new argument: sparse = FALSE (default) to pseudobulk_matrix() in r/R/singlecell_utils.R. When it is set to TRUE, the resulting matrices will become dgCMatrix. Previous codes depending on old pseudobulk_matrix should not be influenced.
[cpp] Added r/src/bpcells-cpp/matrixUtils/PseudobulkSparse.h and .cpp for implementation. The general ideas and calculation methods were copied from Pseudobulk.cpp. Each resulting sparse matrix (Eigen::SparseMatrix<double>) is constructed with .setFromTriplets(). The internal trick is that pseudobulk_matrix_sparse() must receive an ordered cell_groups vector. The input mat should also be ordered according to the cell_groups. The ordering trick is done by the outside caller, that is, the R function pseudobulk_matrix()
[r] Added corresponding unit tests in r/tests/testthat/test-singlecell_utils.R.

Example

# A 26862 x 985167 single-cell matrix
bm1 <- open_matrix_dir("/develop/rstudio/seurat5_learn/data/parse_1m_pbmc/")
bm1

# Simulate 50000 meta-cells with random orders
cell.groups <- sample(1:50000, ncol(bm1), replace = T)
cell.groups <- factor(cell.groups, 1:50000)

# Create an aggregated expression matrix for meta-cells
bench::mark(
  agg_mat <- BPCells::pseudobulk_matrix(bm1, cell_groups = cell.groups, sparse = T)
)

This operation may take a long time but save the RAM overhead for construction of dense matrix.

# A tibble: 1 × 13
  expression                                   min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result             memory     time       gc      
  <bch:expr>                                 <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>             <list>     <list>     <list>  
1 avg2 <- BPCells::pseudobulk_matrix(bm1, c… 6.01m  6.01m   0.00277    4.19GB  0.00277     1     1      6.01m <dgCMatrx[,50000]> <Rprofmem> <bench_tm> <tibble>

Limitations and Request for Feedback

To maintain compatibility with the original pseudobulk_matrix() which performs a single-pass traversal to compute non_zeros, sum, mean, and variance simultaneously, the current sparse implementation follows the same strategy. As a result, even if the user requests only one statistic (e.g., mean), all others are still computed and returned internally, including non_zeros. This design ensures consistent output structure and avoids branching logic, but may introduce unnecessary computational overhead in simple use cases.
I'm wondering whether it would be worthwhile to refactor the logic such that only the requested statistics are computed when sparse = TRUE. This could reduce runtime and memory usage, especially when working with very large datasets and lightweight operations. Any suggestion or better idea for this? Or maybe we currently just keep the original strategy?

Jun 14 '25 04:06 ycli1995

Hi @ycli1995, thanks for the PR! It might be a week or two before we can fully process this, but just letting you know I've seen the PR and I'll discuss it with @immanuelazn.

A few initial ideas:

If you anticipate a use-case where memory usage will be a problem, it's probably worth the extra effort to minimize intermediate memory usage. I think this can likely be done with minimal computational cost as a few extra branches for each matrix column should have negligible cost relative to the actual computations.
It might be interesting to think about whether a two-pass algorithm or unconditionally doing a storage_order_transpose() might lead to better running time on cell-major data by avoiding disk seeks. For two pass I'm imagining a first pass to calculate which output entries are non-zero, then the second pass calculates the actual statistics -- I think it should be possible assuming the entries in each column are loaded in order by row, but I haven't fully thought it out. The storage_order_transpose() option is probably easier to think through and test.

Also, for the use-cases you have in mind, what would you anticipate for cell counts and pseudobulk sizes? If we add storage ordering requirements on the input it would be possible to make the output an "IterableMatrix" (entries calculated on-demand) rather than a "dgCMatrix". Not sure if that would be useful for what you have in mind

Jun 14 '25 05:06 bnprks

Hi @bnprks, thanks for your thoughtful response.

Regarding minimizing intermediate memory usage, I’m thinking about controlling which statistics matrices are initialized based on the method parameter on c++ side. Currently, even if the user only requests "variance", the dense matrices for non_zeros, mean, and var are all initialized, which leads to unnecessary memory usage. Here's a quick example:

# A 33538 x 11769 single-cell matrix
bm1 <- BPCells::open_matrix_10x_hdf5("/develop/10x_data/rna/pbmc10k/filtered_feature_bc_matrix.h5") %>%
  BPCells::write_matrix_dir(tempfile("bm1"))

# Simulate the cell groups
cell.groups <- sample(1:2000, ncol(bm1), replace = T)
cell.groups <- factor(cell.groups, 1:10)

bench::mark(
  check = F, min_time = 10,
  out1 <- BPCells::pseudobulk_matrix(bm1, cell_groups = cell.groups, method = "sum"),
  out2 <- BPCells::pseudobulk_matrix(bm1, cell_groups = cell.groups, method = "nonzeros"),
  out2 <- BPCells::pseudobulk_matrix(bm1, cell_groups = cell.groups, method = "mean"),
  out3 <- BPCells::pseudobulk_matrix(bm1, cell_groups = cell.groups, method = "variance")
)

The memory usage for method = "variance" is actually 3 times by method = "sum":

# A tibble: 4 × 13
  expression                                               min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
  <bch:expr>                                             <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
1 "out1 <- BPCells::pseudobulk_matrix(bm1, cell_groups … 3.66s  3.67s     0.273    2.69MB        0     3     0        11s <NULL> <Rprofmem> <bench_tm> <tibble>
2 "out2 <- BPCells::pseudobulk_matrix(bm1, cell_groups … 3.65s  3.65s     0.274    2.69MB        0     3     0        11s <NULL> <Rprofmem> <bench_tm> <tibble>
3 "out2 <- BPCells::pseudobulk_matrix(bm1, cell_groups … 5.76s  5.76s     0.174    5.25MB        0     2     0      11.5s <NULL> <Rprofmem> <bench_tm> <tibble>
4 "out3 <- BPCells::pseudobulk_matrix(bm1, cell_groups … 7.82s  7.83s     0.128    7.81MB        0     2     0      15.7s <NULL> <Rprofmem> <bench_tm> <tibble>

In my sparse implementation, I tried to reuse buffer-like vectors to handle each group sequentially. This relies on cell_groups being pre-sorted and the input matrix being ordered accordingly. Even in the transposed case, the buffer handles per-gene data across all groups. Therefore, if IterableMatrix can efficiently traverse ordered columns, we could initialize only the required output matrix, saving considerable memory.. Do you think that’s worth pursuing?
As for the suggestion of returning an IterableMatrix instead of a dgCMatrix: that's a very interesting direction. For my current use-case, I’m mainly focused on getting the sparse pseudobulk matrix into R in a format that works smoothly with downstream packages like monocle and SCENIC or other Matrix-based workflows, so dgCMatrix is a good fit for now. However I’d love to explore that direction further if time and resources allow.

Jun 14 '25 16:06 ycli1995

So sorry for the delay on this, I'll be taking over reviewing this. I took the liberty of trying out your implementation on a smaller subset so I can properly AB test on my consumer hardware.

bm1 <- open_matrix_market("DGE_1M_PBMC.mtx") %>% write_matrix_dir("dge_1m_pbmc")
bm2 <- bm1[1:300000,] %>% t() %>% write_matrix_dir("dge_100k_pbmc", overwrite = TRUE)

cell.groups <- sample(1:10000, ncol(bm2), replace = T)
cell.groups <- factor(cell.groups, 1:10000)
bench::mark(
  agg_mat <- BPCells::pseudobulk_matrix(bm2, cell_groups = cell.groups, sparse = T), iterations = 10
)

With the following results on sparse output:

# A tibble: 1 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>    <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 agg_mat <- B…  6.4s  8.19s     0.131    1.06GB   0.0146     9     1      1.14m
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>

And for the dense output:

# A tibble: 1 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>    <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 agg_mat <- B… 7.22s  10.5s    0.0992    4.52GB   0.0661     6     4      1.01m
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>

Essentially shows that difference in computation is pretty minimal, which is great! I'd actually be happy to set sparse as default.

The main consideration that I currently have is how fast the implementation for a matrix-matrix multiplication approach for a pseudobulk should go. We should be able to beat that as a baseline if we're creating some specialized code for this function. Otherwise, we might be unncessarily adding some complicated tech debt to the repo. Using a matrix-matrix multiplication should be able to output everything except the variance pseudobulk, so creating these and A/B testing against pseudobulk_matrix() for time after doing a conversion to a dgCMatrix would be good. If you have time, it would be good to have you take the helm on this. Otherwise, I can do this around next week.

As for the implementation itself, I felt like I had to refer to the Eigen docs more than I would want to for your implementation. You copied my comments which helped align what steps did what, in relation to the previous dense implementation. But we need additional explanation for why there are three new structs to support this. Just overall more explanation for your implementation is a whole would help, pretending that the reader is not familiar with the Eigen sparse methods. I'll leave some comments across today and tomorrow on spots that I think can be better documented.

Also, you can probably get closer to true speeds if you turn on cpp optimizations during devtools builds. I don't think your benchmarks portray the actual times we expect a user to be spending on these functions, but I could be wrong.

library(pkgbuild)
flags <- pkgbuild::compiler_flags(debug=FALSE)
new_compiler_flags <- function(debug=FALSE) {flags}
assignInNamespace('compiler_flags', new_compiler_flags, 'pkgbuild')

devtools::load_all("r", recompile=TRUE)

Sep 16 '25 06:09 immanuelazn

Hi, @immanuelazn .

Sorry for the delayed reply.

Thanks a lot for reviewing this PR. Unfortunately, with my current workload from the company I'm working for, I won’t be able to follow up on those development on github in the near term. If this PR needs to be closed for now to avoid leaving things hanging, I completely understand.

It's really appreciated for your review. Maybe in the future, when I have more spare time, I can revisit and contribute BPCells again. Thanks for your patience and understanding!

Best regards, Yuchen

Sep 27 '25 08:09 ycli1995

Hi @ycli1995 , no problem. Really appreciate the work you put in this. Do you mind if i take over development then?

Sep 27 '25 08:09 immanuelazn

Hi @immanuelazn ,

That would be perfect! I’d be more than happy for you to take over the development. Really glad to see BPCells moving forward, and thanks for your willingness to continue working on it.

Best, Yuchen

Sep 27 '25 09:09 ycli1995

[r][cpp] Add `sparse = TRUE` as an option for `pseudobulk_matrix()`

Motivation

Changes Made

Example

Limitations and Request for Feedback