[r][cpp] Add `sparse = TRUE` as an option for `pseudobulk_matrix()`
Hi, @bnprks, I added an option sparse = TRUE/FALSE to pseudobulk_matrix().
Motivation
Previous pseudobulk_matrix only returns dense matrices for the aggregated data. It looks great enough when the number of cell groups is small. However, when cell_groups is large (e.g., due to many meta-cells), the resulting pseudobulk matrix can become extremely wide. In such cases, returning a dense matrix can lead to significant memory overhead, even though most entries are zeros. This PR addresses that issue by allowing pseudobulk_matrix() to return a sparse matrix, greatly improving memory efficiency for large and sparse groupings.
Changes Made
- [r] Added a new argument:
sparse = FALSE(default) topseudobulk_matrix()inr/R/singlecell_utils.R. When it is set toTRUE, the resulting matrices will becomedgCMatrix. Previous codes depending on oldpseudobulk_matrixshould not be influenced. - [cpp] Added
r/src/bpcells-cpp/matrixUtils/PseudobulkSparse.hand.cppfor implementation. The general ideas and calculation methods were copied fromPseudobulk.cpp. Each resulting sparse matrix (Eigen::SparseMatrix<double>) is constructed with.setFromTriplets(). The internal trick is thatpseudobulk_matrix_sparse()must receive an orderedcell_groupsvector. The inputmatshould also be ordered according to thecell_groups. The ordering trick is done by the outside caller, that is, the R functionpseudobulk_matrix() - [r] Added corresponding unit tests in
r/tests/testthat/test-singlecell_utils.R.
Example
# A 26862 x 985167 single-cell matrix
bm1 <- open_matrix_dir("/develop/rstudio/seurat5_learn/data/parse_1m_pbmc/")
bm1
# Simulate 50000 meta-cells with random orders
cell.groups <- sample(1:50000, ncol(bm1), replace = T)
cell.groups <- factor(cell.groups, 1:50000)
# Create an aggregated expression matrix for meta-cells
bench::mark(
agg_mat <- BPCells::pseudobulk_matrix(bm1, cell_groups = cell.groups, sparse = T)
)
This operation may take a long time but save the RAM overhead for construction of dense matrix.
# A tibble: 1 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 avg2 <- BPCells::pseudobulk_matrix(bm1, c… 6.01m 6.01m 0.00277 4.19GB 0.00277 1 1 6.01m <dgCMatrx[,50000]> <Rprofmem> <bench_tm> <tibble>
Limitations and Request for Feedback
-
To maintain compatibility with the original
pseudobulk_matrix()which performs a single-pass traversal to computenon_zeros,sum,mean, andvariancesimultaneously, the current sparse implementation follows the same strategy. As a result, even if the user requests only one statistic (e.g.,mean), all others are still computed and returned internally, includingnon_zeros. This design ensures consistent output structure and avoids branching logic, but may introduce unnecessary computational overhead in simple use cases. -
I'm wondering whether it would be worthwhile to refactor the logic such that only the requested statistics are computed when
sparse = TRUE. This could reduce runtime and memory usage, especially when working with very large datasets and lightweight operations. Any suggestion or better idea for this? Or maybe we currently just keep the original strategy?
Hi @ycli1995, thanks for the PR! It might be a week or two before we can fully process this, but just letting you know I've seen the PR and I'll discuss it with @immanuelazn.
A few initial ideas:
- If you anticipate a use-case where memory usage will be a problem, it's probably worth the extra effort to minimize intermediate memory usage. I think this can likely be done with minimal computational cost as a few extra branches for each matrix column should have negligible cost relative to the actual computations.
- It might be interesting to think about whether a two-pass algorithm or unconditionally doing a
storage_order_transpose()might lead to better running time on cell-major data by avoiding disk seeks. For two pass I'm imagining a first pass to calculate which output entries are non-zero, then the second pass calculates the actual statistics -- I think it should be possible assuming the entries in each column are loaded in order by row, but I haven't fully thought it out. Thestorage_order_transpose()option is probably easier to think through and test.
Also, for the use-cases you have in mind, what would you anticipate for cell counts and pseudobulk sizes? If we add storage ordering requirements on the input it would be possible to make the output an "IterableMatrix" (entries calculated on-demand) rather than a "dgCMatrix". Not sure if that would be useful for what you have in mind
Hi @bnprks, thanks for your thoughtful response.
- Regarding minimizing intermediate memory usage, I’m thinking about controlling which statistics matrices are initialized based on the
methodparameter on c++ side. Currently, even if the user only requests "variance", the dense matrices fornon_zeros,mean, andvarare all initialized, which leads to unnecessary memory usage. Here's a quick example:
# A 33538 x 11769 single-cell matrix
bm1 <- BPCells::open_matrix_10x_hdf5("/develop/10x_data/rna/pbmc10k/filtered_feature_bc_matrix.h5") %>%
BPCells::write_matrix_dir(tempfile("bm1"))
# Simulate the cell groups
cell.groups <- sample(1:2000, ncol(bm1), replace = T)
cell.groups <- factor(cell.groups, 1:10)
bench::mark(
check = F, min_time = 10,
out1 <- BPCells::pseudobulk_matrix(bm1, cell_groups = cell.groups, method = "sum"),
out2 <- BPCells::pseudobulk_matrix(bm1, cell_groups = cell.groups, method = "nonzeros"),
out2 <- BPCells::pseudobulk_matrix(bm1, cell_groups = cell.groups, method = "mean"),
out3 <- BPCells::pseudobulk_matrix(bm1, cell_groups = cell.groups, method = "variance")
)
The memory usage for method = "variance" is actually 3 times by method = "sum":
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 "out1 <- BPCells::pseudobulk_matrix(bm1, cell_groups … 3.66s 3.67s 0.273 2.69MB 0 3 0 11s <NULL> <Rprofmem> <bench_tm> <tibble>
2 "out2 <- BPCells::pseudobulk_matrix(bm1, cell_groups … 3.65s 3.65s 0.274 2.69MB 0 3 0 11s <NULL> <Rprofmem> <bench_tm> <tibble>
3 "out2 <- BPCells::pseudobulk_matrix(bm1, cell_groups … 5.76s 5.76s 0.174 5.25MB 0 2 0 11.5s <NULL> <Rprofmem> <bench_tm> <tibble>
4 "out3 <- BPCells::pseudobulk_matrix(bm1, cell_groups … 7.82s 7.83s 0.128 7.81MB 0 2 0 15.7s <NULL> <Rprofmem> <bench_tm> <tibble>
-
In my sparse implementation, I tried to reuse buffer-like vectors to handle each group sequentially. This relies on
cell_groupsbeing pre-sorted and the input matrix being ordered accordingly. Even in the transposed case, the buffer handles per-gene data across all groups. Therefore, ifIterableMatrixcan efficiently traverse ordered columns, we could initialize only the required output matrix, saving considerable memory.. Do you think that’s worth pursuing? -
As for the suggestion of returning an
IterableMatrixinstead of adgCMatrix: that's a very interesting direction. For my current use-case, I’m mainly focused on getting the sparse pseudobulk matrix into R in a format that works smoothly with downstream packages likemonocleandSCENICor other Matrix-based workflows, sodgCMatrixis a good fit for now. However I’d love to explore that direction further if time and resources allow.
So sorry for the delay on this, I'll be taking over reviewing this. I took the liberty of trying out your implementation on a smaller subset so I can properly AB test on my consumer hardware.
bm1 <- open_matrix_market("DGE_1M_PBMC.mtx") %>% write_matrix_dir("dge_1m_pbmc")
bm2 <- bm1[1:300000,] %>% t() %>% write_matrix_dir("dge_100k_pbmc", overwrite = TRUE)
cell.groups <- sample(1:10000, ncol(bm2), replace = T)
cell.groups <- factor(cell.groups, 1:10000)
bench::mark(
agg_mat <- BPCells::pseudobulk_matrix(bm2, cell_groups = cell.groups, sparse = T), iterations = 10
)
With the following results on sparse output:
# A tibble: 1 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 agg_mat <- B… 6.4s 8.19s 0.131 1.06GB 0.0146 9 1 1.14m
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
And for the dense output:
# A tibble: 1 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 agg_mat <- B… 7.22s 10.5s 0.0992 4.52GB 0.0661 6 4 1.01m
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
Essentially shows that difference in computation is pretty minimal, which is great! I'd actually be happy to set sparse as default.
The main consideration that I currently have is how fast the implementation for a matrix-matrix multiplication approach for a pseudobulk should go. We should be able to beat that as a baseline if we're creating some specialized code for this function. Otherwise, we might be unncessarily adding some complicated tech debt to the repo. Using a matrix-matrix multiplication should be able to output everything except the variance pseudobulk, so creating these and A/B testing against pseudobulk_matrix() for time after doing a conversion to a dgCMatrix would be good. If you have time, it would be good to have you take the helm on this. Otherwise, I can do this around next week.
As for the implementation itself, I felt like I had to refer to the Eigen docs more than I would want to for your implementation. You copied my comments which helped align what steps did what, in relation to the previous dense implementation. But we need additional explanation for why there are three new structs to support this. Just overall more explanation for your implementation is a whole would help, pretending that the reader is not familiar with the Eigen sparse methods. I'll leave some comments across today and tomorrow on spots that I think can be better documented.
Also, you can probably get closer to true speeds if you turn on cpp optimizations during devtools builds. I don't think your benchmarks portray the actual times we expect a user to be spending on these functions, but I could be wrong.
library(pkgbuild)
flags <- pkgbuild::compiler_flags(debug=FALSE)
new_compiler_flags <- function(debug=FALSE) {flags}
assignInNamespace('compiler_flags', new_compiler_flags, 'pkgbuild')
devtools::load_all("r", recompile=TRUE)
Hi, @immanuelazn .
Sorry for the delayed reply.
Thanks a lot for reviewing this PR. Unfortunately, with my current workload from the company I'm working for, I won’t be able to follow up on those development on github in the near term. If this PR needs to be closed for now to avoid leaving things hanging, I completely understand.
It's really appreciated for your review. Maybe in the future, when I have more spare time, I can revisit and contribute BPCells again. Thanks for your patience and understanding!
Best regards, Yuchen
Hi @ycli1995 , no problem. Really appreciate the work you put in this. Do you mind if i take over development then?
Hi @immanuelazn ,
That would be perfect! I’d be more than happy for you to take over the development. Really glad to see BPCells moving forward, and thanks for your willingness to continue working on it.
Best, Yuchen