BPCells icon indicating copy to clipboard operation
BPCells copied to clipboard

Aggregate rows in BPCells

Open NormalLomo opened this issue 11 months ago • 2 comments

I am currently working on a project that involves approximately 1.3 million cells, and I need to merge gene data in BPCells. I have been unable to find a function that aggregates rows by groups effectively. Currently, the only method available to me involves transposing the matrix and merging it to create a pseudobulk matrix. However, this method returns a dense matrix. Is there a more efficient way to handle this process?

NormalLomo avatar Jan 20 '25 04:01 NormalLomo

Hi @NormalLomo, unfortunately due to the way data is held within BPCells objects, it is pretty much impossible to perform pseudobulks if the groups are rows. If you are curious, it is because we load the columns one at a time within each thread. If we wanted to do a grouping by row, we would have to load in every single column, and find the row one by one.

To achieve what you are looking to do, you are right in calling pseudobulk_matrix(). Specifically, this would probably be the workflow you are looking at:

mat <- transpose_storage_order(mat) # Done to make the underlying representation of your data as row major order.
pseudobulk_matrix(t(mat), ...)

WIth respect to your last point, we expected the use case to be specifically for post-clustering operations for aggregating cells. In this case, we would expect a (n_features, n_groups) matrix, which would multiple orders of magnitude smaller than a (n_features, n_cells) matrix. Additionally, grouping operations typically change values to all non-zero, so we did not see a reason to make it output a dgCMatrix.

If this is not what you are planning on doing a group operation for, could you tell us what your use case is? If your use case is something we have not considered, with direct single cell common applications, we could prioritize returning disk-backed matrices, or dgCMatrix on pseudobulk_matrix()

immanuelazn avatar Jan 20 '25 05:01 immanuelazn

Hi @NormalLomo, just chiming in with another option. If all you want to do is sum or average row (or column) groups, then this can be represented as a matrix multiply operation and still get a sparse matrix output. The cluster_membership_matrix() helper function is probably useful here. The code would look something like this to sum row groups: (it would need tweaking for averaging or aggregating by column)

mat # Your data matrix
row_groups # A vector with 1 entry per row in the matrix, specifying which group it belongs in (e.g. a factor)
group_mat <- BPCells::cluster_membership_matrix(row_groups)
# Ensure your matrix is in "col" order, which is most efficient for this type of row aggregation
if (storage_order(mat) == "row") {
    mat <- transpose_storage_order(mat)
}

# Doing the aggregation is just a matrix multiply
output <- t(row_groups) %*% mat
# You may want to save the aggregated output if you will re-use it multiple times
output <- write_matrix_dir(output, "./my/output_path")

This kind of solution is probably only useful when your group count is in the 50+ range, as for a smaller group count pseudobulk_matrix() is probably fine from a memory and speed perspective (I think only ~1GB of RAM for 50 groups with pseudobulk_matrix() on a dataset of your size)

Also, a slight adjustment to @immanuelazn's answer -- the transpose_storage_order() shouldn't be necessary when using pseudobulk_matrix(). For the solution I outlined above transpose_storage_order() is not strictly needed either, but can greatly improve performance in certain circumstances. I think there might have been a misunderstanding of whether you are adding together genes or adding together cells -- some of his answer is only relevant if you are aggregating cells, and even then an approach similar to the one I've outlined above can make it work.

Hopefully one of our answers helps address your question, but feel free to reply with some more details if our solutions don't help with what you're trying to do.

bnprks avatar Jan 20 '25 09:01 bnprks