msprime icon indicating copy to clipboard operation
msprime copied to clipboard

Add adjust_for_silent option to sim_mutations

Open gtsambos opened this issue 2 years ago • 6 comments

@petrelharp and I had an extended convo in Slack this week about silent mutations and the complications they introduce for users. I have some strong opinions about changes we could make so that future users don't trip over them, but since this would be a breaking change, I thought I'd post parts of our convo here to get feedback from others.

The tl;dr summary is: I think we should change the output of sim_mutations to match what users probably think they are simulating when they specify a given mutation_rate. In my head, here are the steps that this would involve:

  • the mutation rate adjustment that is currently hidden away here should be performed by default inside sim_mutations.
  • Removing silent mutations from the all the methods that operate on/iterate over mutations, like ts.variants() for instance.
  • If the user actually does wants the output to include silent mutations, the current default, this could be specified with some extra argument to sim_mutations, something like include_silent=True

gtsambos avatar May 29 '22 05:05 gtsambos

Here are some relevant bits from our Slack thread.

Me:

they can be a sizeable fraction of the mutations you generate under certain mutation models, even if you are only applying them once. This is part of what’s tricky I think — even if you don’t fall into a use case where the silent mutations are needed, they may still have consequences for the variation in your simulated datasets, sometimes quite a lot

Peter:

But - they should have zero implications for any analysis? Do you have an example of an analysis where they give you different numbers? (other than "let's count number of segregating sites by number of rows in the site table", which isn't right anyways?)

Me:

I guess I'm worried about situations where people don't realise that some of the generated mutations are silent, and then conduct analyses with misleadingly small numbers of variant sites. (Or -- situations where they do realise but don't know what to do about them/assume there's been a mistake, and drop their msprime work in favour of something else they feel they understand better)

If these mutation rate adjustments should always be done to obtain correct distributions, it would be ideal to have some way of doing them under the hood within sim_mutations itself. (Even more ideal -- to automatically prune these silent mutations from the output)

I do think this is a significant usability issue. The only reason why I realised that these mutations existed was because I was manually printing out lots of output on smaller tree sequences. I'm very glad I didn't update my ABC scripts to use msprime 1.0, as I would have definitely missed this and it would have made the kind of subtle but meaningful statistical impact that would have been very hard to diagnose/detect

Also, about 'zero implications for analysis' -- the issue is not that silent mutations do unwanted things to our statistics, but that they are taking the place of 'real' mutations which do things we want and expect. For instance, in simulations with lots of silent mutations, you'll underestimate diversity-type statistics, and increase the variance of all site statistics (since there will be a sparser set of mutations underlying them)

It shouldn't be too hard to do the adjustment under the hood, right? All you need to do is to calculate what fraction of the transition matrix total is in its diagonal elements

Peter:

I don't think it's as simple as "they're taking the place of 'real' mutations" - this exact same thing would be happening even if we did not include the silent mutations in the output. Is what you're saying is that (a) if someone simulates mutations at rate μ under some non-parent-independent model, then what they mean by "at rate μ" is probably "so that we get an average of μ non-silent mutations per unit time per bp in the long run" and so (b) that's what we should make the rate argument to sim_mutations mean, if possible? That seems like a good idea, I'm just not sure if (a) is quite right (this is a question for the phylogenetics literature, since that's where they use other mutation models mostly), and also explaining (b) is a bit tricky (although we've already got to explain (b) more or less so...).

Right - so this is actually orthogonal to the silent mutations being present; their presence just makes the issue more obvious.

I'm totally with you about the vague sense of unease, but what really convinced me this is the way to go is that all these confusing things already happen without silent mutations. People (eg, me) can get very confused trying to compare ts.num_sites to ts.num_mutations and to the actual number of segregating sites - due to multiple hits, the number of sites and mutations aren't the same in an msprime simulation, and a SLiM simulation also has fixed mutations. I think that once we move beyond the infinite sites model these things are just part of life.

Me:

I'm completely onboard with them needing to be there to make the models work -- what I'm uneasy about is the degree to which we are involving the user. I think they'd ideally be a purely internal thing. Even quite experienced users like Gregor and I have had to do a lot of digging and asking to figure out what they do and don't matter for, and I imagine a lot of users would feel intimidated having to do anything vaguely linear algebra-ish to get their simulations running as expected. Even a 10% difference in the number of variants can be significant in simulation-based inference. (It would have been, in the applications i was working on).

Peter:

So - do you agree with my (a) & (b) summary above? If we think that's the right way to go then I'd be happy making that change (although it is backwards-incompatible!); I just want to make sure we've thought it through enough.

And I guess there's the other concern you're raising about having to dig to figure out what they do and don't matter for. Perhaps we should just make this more obvious somewhere - saying "Alert! There may be silent and/or fixed mutations, but these don't affect the following things..."

Me:

Yes to a and b!

gtsambos avatar May 29 '22 05:05 gtsambos

btw, would be very happy to help implement these changes -- just won't have time for at least the next month

gtsambos avatar May 29 '22 05:05 gtsambos

Thanks for the summary! As I see it there are two separate proposals here:

  1. the mutation rate adjustment that is currently hidden away here should be performed by default inside sim_mutations.

I'm in favor of doing this somehow. In other words, make it easy (and probably default?) to make it so that the mutation rate for gives you the mean density in the long run of non-silent mutations (as opposed to all mutations). Note that this is not possible for all mutation models.

  1. Removing silent mutations from the all the methods that operate on/iterate over mutations, like ts.variants() for instance.

I do not think we should do this. I agree that this is a source of confusion, but it's a giant can of worms and doesn't get rid of all the sources of confusion. (For instance, should we remove fixed mutations?) However, we might provide a ts.remove_silent_mutations() method to clear these out if people want.

On the topic of (1.) - when I thought about doing that initially, I thought (a) we could just multiply all the entries in the mutation matrix to achieve this... but then the mutation matrix entries are ugly and weird. Or, (b) we could provide a separate rate_scaling parameter to mutation models that multiplies everything... but then it's weird there's two ways to do the same thing. Lacking a good way to resolve this, I dropped it.

petrelharp avatar May 29 '22 16:05 petrelharp

This is a major change, and would require the introduction of a new API (since we can't silently change the output of existing code), so I'd be pretty reluctant to make Peter's (1) the default. We could add an option like adjust_for_silent=False which would do this maybe?

For (2) I agree with Peter: this would be a huge amount of work, and would probably lead to more confusion in the long run.

We spent a long time going around the pros and cons of silent mutations and the current approach was settled on as the best compromise we could come up with :shrug:. It's too late to change things for msprime 1.0 API now, and we do have to keep stability and support as very high priorities.

jeromekelleher avatar May 30 '22 10:05 jeromekelleher

Hmm, I understand the need for stability, and I apologise for not being on top of the conversation when this issue was initially discussed and designed. Perhaps for now, we could do this

We could add an option like adjust_for_silent=False which would do this maybe?

and update the documentation and tutorial material so that users understand this option. if there is an msprime 2.0 sometime in the distant future, we could reconsider these bigger changes then. With more time, we should get a better sense of whether msprime users are getting tripped up by this , and that can inform how much of a priority it should be.

gtsambos avatar May 31 '22 03:05 gtsambos

OK, I've changed the title of this issue to make it closable and added it to the next release milestone. If anyone wants to help out here with some implementation or testing please do shout out.

jeromekelleher avatar May 31 '22 08:05 jeromekelleher