seqlike icon indicating copy to clipboard operation
seqlike copied to clipboard

Mutation class #57

Open ericmjl opened this issue 2 years ago • 2 comments

This PR is ready for review.

In this PR, we add in Mutation and MutationSet classes. The intent is to represent mutations made to a SeqLike. I've taken inspiration from multiple places, but the biggest one has been the discussion on #57.

The use cases for this Mutation class are primarily two-fold:

  1. Adding them to a SeqLike returns another mutated SeqLike.
  2. Subtracting (diffing) a SeqLike from another yields a MutationSet.

While working through use case 2, I noticed how it's a bit tricky:

  • adding a Mutation or MutationSet to a SeqLike results in a new SeqLike,
  • but subtracting the new SeqLike from the original SeqLike might not necessarily result in the original MutationSet,
  • yet adding the new MutationSet to the original SeqLike will give back a SeqLike identical to the new SeqLike.

An example of this phenomena is shown in a new notebook, docs/notebooks/mutations.ipynb. For reviewing purposes, that is probably the go-to notebook for understanding what Mutation and MutationSet classes can do; the rest are implementation details.

Would love to get feedback on this PR -- especially if there are semantics that I haven't yet thought of.


TODO list of what's left:

  • [x] FIX TEST: SeqLike classes have two more class methods. Therefore, 77 is correct. Reference here.
  • [x] Add .positions class method, which returns a list of positions to mutate.
  • [x] Switch out magical_parse() for __new__() under Mutation.

ericmjl avatar Sep 26 '22 16:09 ericmjl

I am going to do a few more commits to address @andrewgiessel's remaining comments.

ericmjl avatar Oct 05 '22 13:10 ericmjl

@ericmjl , this new mutation functionality is awesome, and the jupyter notebook is very helpful in exploring it! The notation and "arithmetic" conventions are simple, clear and sensible.

The following behavior doesn't look right, however:

s1 = aaSeqLike("MKAILV")
ms2 = MutationSet(mutations="^2A;^4D;5-".split(";"))
str(s1 + ms2)

yields

MAAD-V

whereas I'd expect

MAKD-ILV

Perhaps this is related, but the following also gives an unexpected result:

s1 - (s1 + Mutation("5-"))

yields

[5V, 6-]

instead of

[5-]

ndousis avatar Oct 05 '22 21:10 ndousis

@ndousis thanks for the feedback!

The behaviour you've outlined is one that I've been thinking about w.r.t. semantics. The essence is that there is an incrementing of position numbers every time that we have an insertion, so that any insertions are preserved relative to the original numbering of the sequence. In the example you showed, on the latest commit ffdb0bc, I get the behaviour defined by the aforementioned semantics:

image

Do you think it'll help to jump on a call with you to see if the current semantics make sense? If so, could you email me at my work address (intentionally not posting here to avoid spam)? I'll send you a calendly link that we can use to schedule.

ericmjl avatar Nov 04 '22 20:11 ericmjl

@ndousis and @andrewgiessel this PR is ready for more reviews. I also hope that we can get official "approvals" on the PR before merging, so that we leave a record.

ericmjl avatar Nov 04 '22 21:11 ericmjl

@ericmjl , I think the current semantics make sense, and they do yield stable solutions. I was troubled by the asymmetry of cases like s1 - (s1 + Mutation("5-")), but as you've pointed out in a previous version of mutations.ipynb, consistent annotation of indels is difficult without pairwise alignment, which in itself is inconsistent.

Mutation and MutationSet are very thoughtful and useful new features in SeqLike. Approved!

ndousis avatar Dec 07 '22 15:12 ndousis

Thank you, @andrewgiessel and @ndousis! I am going to merge and cut a new release once the CI passes!

ericmjl avatar Dec 08 '22 22:12 ericmjl