Potentially contributing to the Python API
Dear Author,
Thanks for your contribution of this amazing algorithm and package. For my own use, I have modified the python API and related Cpp code a bit mainly: 1. save the library size (n_fragments) and optionally the group_name (barcode/pseudobulk id/ etc) together with the precalculated matrix. 2. added a new wrapper MultiPrecalculatedInsertionMatrix to load multiple PrecalculatedInsertionMatrix with a glob (useful for handle multiple samples). The changes are in this fork
However, I'm not sure about your requirement for code contribution and I'm sure my code has some style difference with your code 😂. Based on my understanding, the python and R code shares the same cpp backend. I don't want to cause breaking changes to the R part, so for now I'll just leave this note here 😂 If you are interested in incorporating these changes, we can try to figure out how to move forward.
Sincerely, Xi
Hi @fuxialexander, thanks for reaching out! I've taken a quick skim through the changes you've made in your fork. It's very impressive that you were able to jump in and make C++ and Python changes independently and we're definitely happy to take contributions.
Overall, our current status for taking python contributions is:
- Everything is in an "experimental" status, meaning we are planning to do a proper overhaul over the next few months to make the Python functionality much more complete and closer to the R functionality.
- In the mean time, we're happy to add in to the experimental Python functionality, with the understanding that things will likely get re-arranged at some point in the future
- All the C++ and python changes you've made look like they are totally fair game to change, as it's all python-specific code (only the
bpcells-cppfolder is shared)
In short, I think we'll be able to take a lot of what you've wrote as a PR. Give me a few days to read + play with your code in more detail, as I suspect there are a few spots where I'll have some style suggestions to make it work a bit more consistently with the rest of BPCells.
-Ben
Sounds great and thanks! Most credits should goes to Cursor/LLM though lol, I'm mostly a test runner 🤪
Hi @fuxialexander, sorry for the delay getting back to you. I've put my notes below. It's mostly a lot of interface-type requests to make sure that what you add matches well with the rest of the library. I hope my suggestions are clear, but if you have any disagreements, questions or comments just let me know. If you're able to address most of the suggestions, please open a PR so we can do a more final code review and merge in your changes!
-Ben
-
For adding group names, I think a simpler approach would be:
-
Store group names as the row names of the output matrix, rather than creating a new
group_names.jsonfile. (TheRenameDimsclass might be useful for setting row names) -
To avoid needing to add the
group_namesparameters, tweakbuild_cell_groupsto return apd.Categorical, and update thecell_groupsparameter ofpseudobulk_insertion_countsto take an argument of typeUnion[Sequence[int], pd.Categorical](taking group names from the category names if the passed type is apd.Categorical)
-
-
For
MultiPrecalculatedInsertionMatrix, I'm a bit worried about having to maintain two copies of the interface, one forPrecalculatedInsertionMatrixand one for theMultivariant.-
I think it might be simpler to just replace the implementation of
PrecalculatedInsertionMatrixwith most of what you've done forMulti, changing the constructor so thatpathcan be either a singlestror aSequence[str]. That way we just need one class to maintain -
Just confirming -- you have a specific desire for the
collapse_groupparameter? It seems like a handy option, though if you don't need it yourself I'd slightly prefer to leave it out and keep things simpler until someone needscollapse_group.
-
-
For the
library_sizesfunctionality, two small notes:-
When calculating library_sizes in C++, I think it can be done much more simply by calling
rowSums()on thefull_matvariable. The row sum will happen in parallel if theConcatColsobject is constructed with thethreadsparameter set >1. -
For the format of the
"library_size"file, it should use either an already existing BPCells binary format or a text-based format (e.g. JSON). If you expect the number of groups to always be below 1K, then JSON is probably fine, but if you want to be really safe for scalability you could use a binary file. For the binary file, you'd want to construct aFileNumWriterorFileNumReaderobject in C++ to utilize the existing binary storage formats from BPCells.
-
Thanks for your review! Will try to propose a PR this week~
Hi Ben,
I've made some edits accordingly, please see if this works
https://github.com/bnprks/BPCells/commit/fb67c5d2dd42bbfee5073abf8aea8b479681a80c
Sincerely, Xi
Hi Xi, thanks for the new edits! I'll try to look them over within the next week.