msprime icon indicating copy to clipboard operation
msprime copied to clipboard

GenicSelection model cannot complete a sweep with a neutral trajectory.

Open molpopgen opened this issue 2 years ago • 23 comments

The genic selection model is not giving the correct output when the starting frequency is high.

The code to reproduce the plot below is here, and the README discusses the dependencies.

The following plots show the ECDF for the number of segregating sites for:

  1. dicsoal (blue) with selection
  2. msprime (black) with selection
  3. msprime (red) without selection.

The upper left panel is a hard sweep, and the two simulations with selection agree. As the initial frequency increases, though, msprime "gets neutral" too quickly.

What is not shown here is that forward simulations agree with discoal for all 4 panels. I can provide figures if desired, but it looks like there's a bug in msprime.

compare

Edit for future readers/folks interested in the model:

The cause of this issue is that, at the end of a GenicSelection model, we are left with a mutation at frequency start_frequency. If that frequency is high, then we have not simulated the ancestry of the sample back in time to the origin of the mutation.
In other words, we need a mechanism to complete the history of the structured coalescent from start_frequency back to ~ 0. For sweeps from standing variation, this would be done using a trajectory for a neutral mutation.

molpopgen avatar Jul 12 '21 19:07 molpopgen

huh interesting... i wonder what i messed up here. i'll dig a bit.

andrewkern avatar Jul 12 '21 19:07 andrewkern

For the record, I caught this b/c using msprime instead of discoal made the accuracy of diploSHIC "quite poor" and I tracked it down to this diff in the number of segregating sites.

molpopgen avatar Jul 12 '21 19:07 molpopgen

so browsed this quickly-- i think the difference is that discoal is doing a full soft sweep (i.e. conditioning on a neutral trajectory from freq f_0 -> 1/2N) where as we don't yet have that for msprime, so msprime is going straight from f_0 back to a standard coalescent sim. does that make sense you reckon Kevin?

andrewkern avatar Jul 12 '21 21:07 andrewkern

Msprime must be doing something, as initial frequency has a predictable effect in the expected direction. But it does seem like msprime is exiting the sweep phase too early.

molpopgen avatar Jul 12 '21 21:07 molpopgen

yeah i think that's it. to get msprime to do something equivalent to the discoal -f option we need to add a neutral trajectory model or maybe just a soft sweep model. it's on my radar.

andrewkern avatar Jul 12 '21 21:07 andrewkern

So we need to be a bit careful here, and will want @jeromekelleher to weigh in on this. We have a 1.0 API that has a start_frequency in there, so fiddling with that could be API-breaking. Seems that we also don't want to add more classes here? It could be confusing to have a separate SoftSweep class as a model in addition to GenicSelection--what if SelectionWithDominance is added later? Would we want a separate SoftSweepWithDominance for consistency? It can kinda blow up on you.

To me, it seems that we would prefer features added to GenicSelection that allow the remaining history from x to 1/2N to be treated as neutral.

molpopgen avatar Jul 12 '21 22:07 molpopgen

This issue implies that the examples here are somewhat nonsensical given the current implementation: https://tskit.dev/msprime/docs/latest/ancestry.html#multiple-sweeps

molpopgen avatar Jul 12 '21 22:07 molpopgen

My take would be that we can be relatively relaxed about making changes the SweepGenicSelection class - it's new, I doubt many people are using it yet, and I would much rather it be clean and correct than be strict about a 1.0 contract. I don't understand the details, so I'm not really able to weigh in on how things should be designed. Very much open to ideas on how to do this, and what the correct approach is.

Another reasonable approach, if we need to redesign things, might be to make a new selection model class, that has room to grow into all the various things you mention above.

This issue implies that the examples here are somewhat nonsensical given the current implementation:

As I say - I don't really know what I'm doing here, and was just trying to make some examples to illustrate the APIs. I'd be delighted if someone could put in more/better examples.

jeromekelleher avatar Jul 13 '21 07:07 jeromekelleher

.

This issue implies that the examples here are somewhat nonsensical given the current implementation:

As I say - I don't really know what I'm doing here, and was just trying to make some examples to illustrate the APIs. I'd be delighted if someone could put in more/better examples.

I'll push a fix so that they are generating correct models.

molpopgen avatar Jul 13 '21 13:07 molpopgen

Kevin are you going to implement the neutral trajectory stuff?

andrewkern avatar Jul 13 '21 14:07 andrewkern

So we need to be a bit careful here, and will want @jeromekelleher to weigh in on this. We have a 1.0 API that has a start_frequency in there, so fiddling with that could be API-breaking.

Also-- we definitely want start_frequency to remain in there. that's how i planned to break sweeps into phases.

andrewkern avatar Jul 13 '21 14:07 andrewkern

Kevin are you going to implement the neutral trajectory stuff?

No. Someone who's already worked on that code oath should do it.

molpopgen avatar Jul 13 '21 14:07 molpopgen

.

This issue implies that the examples here are somewhat nonsensical given the current implementation:

As I say - I don't really know what I'm doing here, and was just trying to make some examples to illustrate the APIs. I'd be delighted if someone could put in more/better examples.

I'll push a fix so that they are generating correct models.

By this, I mean fix the examples to be hard sweeps. Sorry for any confusion.

molpopgen avatar Jul 13 '21 14:07 molpopgen

Ah okay. I have a branch going with some of the neutral traj code. I'll clean that up and start getting a PR together.

andrewkern avatar Jul 13 '21 14:07 andrewkern

I'll also open up another issue where we can discuss semantics for what could be a more general class for these sorts of models.

molpopgen avatar Jul 13 '21 14:07 molpopgen

I edited the title/body to reflect the actual cause, so that we can remember this in the future.

molpopgen avatar Jul 13 '21 18:07 molpopgen

Wondering how feasible it is to just let users submit their own frequency trajectories? My group would definitely make use of this function, were it available. We are currently going back to mssel for this.

jjberg2 avatar Jul 15 '22 21:07 jjberg2

@jjberg2 -- that is an interesting idea. There's a lot to do before we get there, I think. Right now, the back-end is nowhere near as general as what discoal, etc., support.

molpopgen avatar Jul 15 '22 22:07 molpopgen

We'd love to do this @jjberg2, but there's noone to work on it at present and it's not a trivial undertaking unfortunately. See #1780 for more thoughts.

jeromekelleher avatar Jul 18 '22 10:07 jeromekelleher

Got it, thanks. I think we'd be very interested in having this functionality, and I know a few other groups who have used mssel within the past few years, who I'm sure would also benefit from having this option implemented in msprime. Suppose I had a student who was interested in working on this...do you think it could work as a PhD chapter? Or is there more/less work than that.

jjberg2 avatar Jul 27 '22 21:07 jjberg2

@jjberg2 -- it may be useful to talk "live" about this. We can go over what is needed, etc..

molpopgen avatar Jul 28 '22 19:07 molpopgen

do you think it could work as a PhD chapter? Or is there more/less work than that.

I'd say not great PhD chapter material @jjberg2-- a bunch of low level work to do, and the outcome (a new feature in msprime) isn't really publishable on its own I reckon.

andrewkern avatar Jul 28 '22 20:07 andrewkern

As Andy says @jjberg2, it would be hard to publish on this work. I'm happy to have a discussion about it though - please do ping me via email if you'd like to set up a meeting.

jeromekelleher avatar Jul 30 '22 11:07 jeromekelleher