loupeR icon indicating copy to clipboard operation
loupeR copied to clipboard

feat: Support `count_mat` as BPCells matrix

Open ycli1995 opened this issue 7 months ago • 3 comments

This will enable users to use BPCells matrix to save RAM.

ycli1995 avatar May 09 '25 08:05 ycli1995

Can you give more of an explanation of what this PR is doing? Can users not use the BPCells library and then invoke the create_loupe function with their custom count matrix? Are you saying that we are saving memory because BPCells sparse matrices are more compact than dgCMatrix?

Just briefly looking at the PR, I don't like all of the duplication and I worry about depending on on a v0.3.0 of a library

esiegel avatar May 12 '25 17:05 esiegel

Sorry about that my initial PR was a bit minimal in terms of explanation. I should have provided more context on the motivation of this PR. Thanks for pointing it out!

The original create_loupe() ask for count_mat to be only dgCMatrix, which means users should load the whole matrix into R session. This could be high memory-cost when it comes to large scale data (>100k cells). With BPCells, however, users can create 10x matrix HDF5 without loading the whole matrix.

For example, to create .cloupe from a matrix of 200k cells in an h5ad file, one can first open the matrix with BPCells::open_matrix_anndata_hdf5():

count_mat <- BPCells::open_matrix_anndata_hdf5(path_to_h5ad)

This operation will not actually load the whole matrix. We can then create the .cloupe file with create_loupe():

create_loupe(count_mat, ...)

The actual logic is create_hdf5_BPCells(count_mat, h5path) -> louper_create_cloupe(h5path). In create_hdf5_BPCells(), data from input matrix are transfered to h5path using buffered stream and delayed evaluation, with the default buffer size as only 16384 bytes. Therefore it can save the memory cost at HDF5 writing.

Regarding your concerns:

  • Duplication: You are right —— I’ll try refactoring the code to reduce duplication. The current version was aiming for keeping the change clear from the original version, and for ensuring the new features behave exactly as dgCMatrix does in unit tests. But I'm happy to clean it up.
  • Dependency: BPCells installation will be checked when users call create_loupe() on matrix classes inheriting from IterableMatrix, and will raise an error if the package is not installed. This follows a similar pattern to how Seurat::Read10X_h5() checks for hdf5r.

Let me know if you'd be open to including this functionality — I'm happy to make further adjustments if you'd prefer a different direction. Also I completely understand if you’d prefer to avoid adding a dependency on BPCells at this stage. Thanks again for taking the time to review this — really appreciate your work on loupeR!

ycli1995 avatar May 13 '25 07:05 ycli1995

After thinking about it, I think adding BPCell support would be a great feature. Seems like it is popular enough - Seurat even has vignettes encouraging using it.

Maybe for cleaning up the code we can:

  • continue with a single exported create_hdf5 function. Now it will accept either the dgCMatrix or the IterableMatrix. It is here we also should verify that the user has the optional BPCells installed and fail fast.
  • Create the internal functions write_mat_dgcmatrix and write_mat_bpcells that takes a path where the H5 should be written and writes out only the matrix data (indptr, data, indices) and not the feature data.

Thanks again

esiegel avatar May 16 '25 17:05 esiegel

Thanks for the detailed reply — really appreciate the constructive feedback and your willingness to support this feature!

Your suggestions on simplifying the logic and separating out the internal write_mat_* functions make a lot of sense, and should make the code cleaner going forward.

One small thing to flag in advance: when using BPCells to output the 10x-style matrix HDF5, it always writes the barcodes and features groups in its own implementation. In details, BPCells writes the strings data with unlimited length while loupeR does not. Unfortunately, this format doesn't match what louper.exe expects, and causes errors when creating the .cloupe file.

My current workaround is to let BPCells write the matrix part, then manually delete and rewrite the barcodes and features groups using loupeR’s existing logic. It’s a bit of a hack, but avoids having to open an issue or PR on BPCells itself. This part may take me a bit of time to clean up. I’ll aim to revisit it as soon as I can. Totally agree it’s worth polishing up.

ycli1995 avatar May 17 '25 02:05 ycli1995

One small thing to flag in advance: when using BPCells to output the 10x-style matrix HDF5, it always writes the barcodes and features groups in its own implementation. In details, BPCells writes the strings data with unlimited length while loupeR does not. Unfortunately, this format doesn't match what louper.exe expects, and causes errors when creating the .cloupe file.

My current workaround is to let BPCells write the matrix part, then manually delete and rewrite the barcodes and features groups using loupeR’s existing logic. It’s a bit of a hack, but avoids having to open an issue or PR on BPCells itself. This part may take me a bit of time to clean up. I’ll aim to revisit it as soon as I can. Totally agree it’s worth polishing up.

Thank you for pointing out the shortcoming of louper.exe and the format that it expects strings. I think using unlink to remove the datasets from the H5 file is a fine solution for now. It will lead to the files being slightly larger as unlink doesn't actually remove the bytes, but I think this is fine. Longer term, the executable should be updated to support non-fixed length strings.

esiegel avatar May 19 '25 16:05 esiegel

Just checking in. Any update on this?

esiegel avatar Jun 11 '25 17:06 esiegel

Sorry for the delayed response. I finally get back from my full-time job. T^T

The current version has merged the logic of create_hdf5_BPCells to the original create_hdf5. To keep create_hdf5() simple, I only added a conditional branch when count_mat is BPCells matrix. The write_mat() function previously used for dgCMatrix was kept unchanged.

For the unit tests, I did not merge the "can create hdf5 with BPCells" to "can create hdf5". After the hdf5 file written, the tests codes are generally the same. Do you want me to extract that part of codes into a helper function such as validate_hdf5()?

ycli1995 avatar Jun 15 '25 15:06 ycli1995