BPCells Potentially contributing to the Python API

Dear Author,

Thanks for your contribution of this amazing algorithm and package. For my own use, I have modified the python API and related Cpp code a bit mainly: 1. save the library size (n_fragments) and optionally the group_name (barcode/pseudobulk id/ etc) together with the precalculated matrix. 2. added a new wrapper MultiPrecalculatedInsertionMatrix to load multiple PrecalculatedInsertionMatrix with a glob (useful for handle multiple samples). The changes are in this fork

However, I'm not sure about your requirement for code contribution and I'm sure my code has some style difference with your code 😂. Based on my understanding, the python and R code shares the same cpp backend. I don't want to cause breaking changes to the R part, so for now I'll just leave this note here 😂 If you are interested in incorporating these changes, we can try to figure out how to move forward.

Sincerely, Xi

May 06 '25 03:05 fuxialexander

Hi @fuxialexander, thanks for reaching out! I've taken a quick skim through the changes you've made in your fork. It's very impressive that you were able to jump in and make C++ and Python changes independently and we're definitely happy to take contributions.

Overall, our current status for taking python contributions is:

Everything is in an "experimental" status, meaning we are planning to do a proper overhaul over the next few months to make the Python functionality much more complete and closer to the R functionality.
In the mean time, we're happy to add in to the experimental Python functionality, with the understanding that things will likely get re-arranged at some point in the future
All the C++ and python changes you've made look like they are totally fair game to change, as it's all python-specific code (only the bpcells-cpp folder is shared)

In short, I think we'll be able to take a lot of what you've wrote as a PR. Give me a few days to read + play with your code in more detail, as I suspect there are a few spots where I'll have some style suggestions to make it work a bit more consistently with the rest of BPCells.

-Ben

May 07 '25 01:05 bnprks

Sounds great and thanks! Most credits should goes to Cursor/LLM though lol, I'm mostly a test runner 🤪

May 07 '25 03:05 fuxialexander

Hi @fuxialexander, sorry for the delay getting back to you. I've put my notes below. It's mostly a lot of interface-type requests to make sure that what you add matches well with the rest of the library. I hope my suggestions are clear, but if you have any disagreements, questions or comments just let me know. If you're able to address most of the suggestions, please open a PR so we can do a more final code review and merge in your changes!

-Ben

For adding group names, I think a simpler approach would be:
1. Store group names as the row names of the output matrix, rather than creating a new group_names.json file. (The RenameDims class might be useful for setting row names)
2. To avoid needing to add the group_names parameters, tweak build_cell_groups to return a pd.Categorical, and update the cell_groups parameter of pseudobulk_insertion_counts to take an argument of type Union[Sequence[int], pd.Categorical] (taking group names from the category names if the passed type is a pd.Categorical)
For MultiPrecalculatedInsertionMatrix, I'm a bit worried about having to maintain two copies of the interface, one for PrecalculatedInsertionMatrix and one for the Multi variant.
1. I think it might be simpler to just replace the implementation of PrecalculatedInsertionMatrix with most of what you've done for Multi, changing the constructor so that path can be either a single str or a Sequence[str]. That way we just need one class to maintain
2. Just confirming -- you have a specific desire for the collapse_group parameter? It seems like a handy option, though if you don't need it yourself I'd slightly prefer to leave it out and keep things simpler until someone needs collapse_group.
For the library_sizes functionality, two small notes:
1. When calculating library_sizes in C++, I think it can be done much more simply by calling rowSums() on the full_mat variable. The row sum will happen in parallel if the ConcatCols object is constructed with the threads parameter set >1.
2. For the format of the "library_size" file, it should use either an already existing BPCells binary format or a text-based format (e.g. JSON). If you expect the number of groups to always be below 1K, then JSON is probably fine, but if you want to be really safe for scalability you could use a binary file. For the binary file, you'd want to construct a FileNumWriter or FileNumReader object in C++ to utilize the existing binary storage formats from BPCells.

May 16 '25 06:05 bnprks

Thanks for your review! Will try to propose a PR this week~

May 19 '25 18:05 fuxialexander

Hi Ben,

I've made some edits accordingly, please see if this works

https://github.com/bnprks/BPCells/commit/fb67c5d2dd42bbfee5073abf8aea8b479681a80c

Sincerely, Xi

Jun 13 '25 22:06 fuxialexander

Hi Xi, thanks for the new edits! I'll try to look them over within the next week.

Jun 17 '25 05:06 bnprks