scirpy Mutational load function (SHM)

Added mutational_load function to calculate differences between sequence and germline alignment. This is especially useful/insightful for BCR due to SHM and help to understand how much mutational actually occurred. However, this is a rather simple approach!

Closes #...

[ ] CHANGELOG.md updated
[ ] Tests added (For bug fixes or new features)
[ ] Tutorial updated (if necessary)

Aug 09 '24 14:08 MKanetscheider

In terms of implementation, I think we're getting there :) Still need to try it out myself to check if I like the overall workflow/interface when using this in a juptyer notebook.

Aug 19 '24 15:08 grst

Hi Gregor, I worked on the test case for the mutational_load function and finished some kind of "beta" version. I would be very grateful if you could have a look if I'm going in the right direction here :) Additionally, I discovered some bugs while testing, which I quickly resolved, but maybe not elegantly so please have also a look on that.

For some reason pushing these changes seem to have broken something with MuData, but I have no idea why and what I could possibly have done to cause this 😢 The error massage seems to be everywhere the same: ImportError: cannot import name 'AlignedViewMixin' from 'anndata._core.aligned_mapping' Could you help me solve this?

Aug 29 '24 10:08 MKanetscheider

Breaking mudata is not your fault. It was caused by an anndata release and should be fixed by now. Just rerun the tests :)

Aug 29 '24 20:08 grst

Codecov Report

Attention: Patch coverage is 87.50000% with 12 lines in your changes missing coverage. Please review.

Project coverage is 81.75%. Comparing base (08e0cc3) to head (4ad7d59). Report is 20 commits behind head on main.

Files with missing lines	Patch %	Lines
src/scirpy/tl/_mutational_load.py	85.88%	12 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #536      +/-   ##
==========================================
+ Coverage   81.43%   81.75%   +0.31%     
==========================================
  Files          49       50       +1     
  Lines        4213     4451     +238     
==========================================
+ Hits         3431     3639     +208     
- Misses        782      812      +30

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

Aug 30 '24 18:08 codecov[bot]

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Oct 17 '24 19:10 review-notebook-app[bot]

@MKanetscheider, two more considerations

Can we somehow verify that the alignments are really using the IMGT reference? Could we at least check something like length? If they were using a different reference, then the results would be pretty nonsensical, wouldn't they?
Especially in the case of region = "subregion", we end up with a ton on columns in adata.obs. To me it seems that the mutational load is actually a chain-level attribute, so how about adding it to the awkward array instead? In that case we could get rid of the following arguments
- chain_idx_key (just calculate it for all chains)
- chains (just calculate it for all chains)
- region (just calculate it for all regions)
- frequency (just calculate both)
The function would then add the chain-level attributes mutation_count, mutation_freq, {cdr,fwd}{1,2,3,4}_mutation_{count,load} v_segment_mutation_{count,freq}.

Afterwards, they could be retrieved like any other chain attribute:
```
df = ir.get.airr_df(mdata, ["VDJ_1"], ["mutation_count", "mutation_freq"])
```

LMK what you think!

Nov 21 '24 19:11 grst

@MKanetscheider, two more considerations

Can we somehow verify that the alignments are really using the IMGT reference? Could we at least check something like length? If they were using a different reference, then the results would be pretty nonsensical, wouldn't they?

Unfortunately yes, if sequences are not IMGT aligned this whole function would be either non-functional any more or just returning some random nonsense 😢 I think I actually do already check for length, because I count differences via hamming-distance, which raises a ValueError if germline and sequence alignments have different lengths. However, I would also like to have a better safety net, but I couldn't come up with anything else so far...

Especially in the case of region = "subregion", we end up with a ton on columns in adata.obs. To me it seems that the mutational load is actually a chain-level attribute, so how about adding it to the awkward array instead? In that case we could get rid of the following arguments

chain_idx_key (just calculate it for all chains)

chains (just calculate it for all chains)

region (just calculate it for all regions)

frequency (just calculate both)

The function would then add the chain-level attributes mutation_count, mutation_freq, {cdr,fwd}{1,2,3,4}_mutation_{count,load} v_segment_mutation_{count,freq}. Afterwards, they could be retrieved like any other chain attribute:
df = ir.get.airr_df(mdata, ["VDJ_1"], ["mutation_count", "mutation_freq"])
LMK what you think!

I think this sounds rather amazing...as you are already well aware, this whole function is rather ugly and not that user-friendly at the moment...I would really love to see it in a more compact format But do you think that we might run into any performance problems if we always calculate everything with one function call? The function would have to calculate every hamming distance of each sequence/germline alignment pair, which could be troublesome for this big datasets that Felix and you are trying to reach, right?

Nov 23 '24 10:11 MKanetscheider

I think I actually do already check for length, because I count differences via hamming-distance, which raises a ValueError if germline and sequence alignments have different lengths

I think that's good enough then. After all it's also clearly stated in the documentation.

But do you think that we might run into any performance problems if we always calculate everything with one function call?

I don't think it would be an issue, because it will still be linear over the number of chains. The part Felix has been working on compares all-vs-all sequences, which is quadratic over the number of sequences and therefore a much harder problem.

Nov 23 '24 18:11 grst

I think this sounds rather amazing...as you are already well aware, this whole function is rather ugly and not that user-friendly at the moment...I would really love to see it in a more compact format But do you think that we might run into any performance problems if we always calculate everything with one function call? The function would have to calculate every hamming distance of each sequence/germline alignment pair, which could be troublesome for this big datasets that Felix and you are trying to reach, right?

I gave it a try in https://github.com/scverse/scirpy/pull/573. Speed-wise, it is about as fast as running all regions on VJ_1/VDJ_1 chains in your implementation, but including all chains. It could also be further optimized, but it's likely not a bottleneck... it still completes within a few minutes on 1M cells.

I'd still need to deal with a bunch of edge cases, but I think the approach is viable.

Nov 24 '24 15:11 grst

Closing in favor of #573. Thanks for all your input @MKanetscheider! The optimized version in #573 wouldn't have been possible without it!

Feb 27 '25 15:02 grst

Closing in favor of #573. Thanks for all your input @MKanetscheider! The optimized version in #573 wouldn't have been possible without it!

Thanks for the nice words, but to stay realistic my input was at best the ground idea, while my implementation was far of anything useful. Thank you so much that you still could extract something useful for this amazing package out of it, which hopefully contributes to research in various labs 😄

Mar 01 '25 10:03 MKanetscheider