msprime
msprime copied to clipboard
GenicSelection model cannot complete a sweep with a neutral trajectory.
The genic selection model is not giving the correct output when the starting frequency is high.
The code to reproduce the plot below is here, and the README discusses the dependencies.
The following plots show the ECDF for the number of segregating sites for:
- dicsoal (blue) with selection
- msprime (black) with selection
- msprime (red) without selection.
The upper left panel is a hard sweep, and the two simulations with selection agree. As the initial frequency increases, though, msprime "gets neutral" too quickly.
What is not shown here is that forward simulations agree with discoal for all 4 panels. I can provide figures if desired, but it looks like there's a bug in msprime.
Edit for future readers/folks interested in the model:
The cause of this issue is that, at the end of a GenicSelection model, we are left with a mutation at frequency start_frequency
.
If that frequency is high, then we have not simulated the ancestry of the sample back in time to the origin of the mutation.
In other words, we need a mechanism to complete the history of the structured coalescent from start_frequency
back to ~ 0
.
For sweeps from standing variation, this would be done using a trajectory for a neutral mutation.
huh interesting... i wonder what i messed up here. i'll dig a bit.
For the record, I caught this b/c using msprime instead of discoal made the accuracy of diploSHIC "quite poor" and I tracked it down to this diff in the number of segregating sites.
so browsed this quickly-- i think the difference is that discoal is doing a full soft sweep (i.e. conditioning on a neutral trajectory from freq f_0 -> 1/2N) where as we don't yet have that for msprime, so msprime is going straight from f_0 back to a standard coalescent sim. does that make sense you reckon Kevin?
Msprime must be doing something, as initial frequency has a predictable effect in the expected direction. But it does seem like msprime is exiting the sweep phase too early.
yeah i think that's it. to get msprime to do something equivalent to the discoal -f
option we need to add a neutral trajectory model or maybe just a soft sweep model. it's on my radar.
So we need to be a bit careful here, and will want @jeromekelleher to weigh in on this. We have a 1.0 API that has a start_frequency
in there, so fiddling with that could be API-breaking. Seems that we also don't want to add more classes here? It could be confusing to have a separate SoftSweep class as a model in addition to GenicSelection--what if SelectionWithDominance is added later? Would we want a separate SoftSweepWithDominance for consistency? It can kinda blow up on you.
To me, it seems that we would prefer features added to GenicSelection that allow the remaining history from x to 1/2N to be treated as neutral.
This issue implies that the examples here are somewhat nonsensical given the current implementation: https://tskit.dev/msprime/docs/latest/ancestry.html#multiple-sweeps
My take would be that we can be relatively relaxed about making changes the SweepGenicSelection class - it's new, I doubt many people are using it yet, and I would much rather it be clean and correct than be strict about a 1.0 contract. I don't understand the details, so I'm not really able to weigh in on how things should be designed. Very much open to ideas on how to do this, and what the correct approach is.
Another reasonable approach, if we need to redesign things, might be to make a new selection model class, that has room to grow into all the various things you mention above.
This issue implies that the examples here are somewhat nonsensical given the current implementation:
As I say - I don't really know what I'm doing here, and was just trying to make some examples to illustrate the APIs. I'd be delighted if someone could put in more/better examples.
.
This issue implies that the examples here are somewhat nonsensical given the current implementation:
As I say - I don't really know what I'm doing here, and was just trying to make some examples to illustrate the APIs. I'd be delighted if someone could put in more/better examples.
I'll push a fix so that they are generating correct models.
Kevin are you going to implement the neutral trajectory stuff?
So we need to be a bit careful here, and will want @jeromekelleher to weigh in on this. We have a 1.0 API that has a
start_frequency
in there, so fiddling with that could be API-breaking.
Also-- we definitely want start_frequency to remain in there. that's how i planned to break sweeps into phases.
Kevin are you going to implement the neutral trajectory stuff?
No. Someone who's already worked on that code oath should do it.
.
This issue implies that the examples here are somewhat nonsensical given the current implementation:
As I say - I don't really know what I'm doing here, and was just trying to make some examples to illustrate the APIs. I'd be delighted if someone could put in more/better examples.
I'll push a fix so that they are generating correct models.
By this, I mean fix the examples to be hard sweeps. Sorry for any confusion.
Ah okay. I have a branch going with some of the neutral traj code. I'll clean that up and start getting a PR together.
I'll also open up another issue where we can discuss semantics for what could be a more general class for these sorts of models.
I edited the title/body to reflect the actual cause, so that we can remember this in the future.
Wondering how feasible it is to just let users submit their own frequency trajectories? My group would definitely make use of this function, were it available. We are currently going back to mssel for this.
@jjberg2 -- that is an interesting idea. There's a lot to do before we get there, I think. Right now, the back-end is nowhere near as general as what discoal, etc., support.
We'd love to do this @jjberg2, but there's noone to work on it at present and it's not a trivial undertaking unfortunately. See #1780 for more thoughts.
Got it, thanks. I think we'd be very interested in having this functionality, and I know a few other groups who have used mssel within the past few years, who I'm sure would also benefit from having this option implemented in msprime. Suppose I had a student who was interested in working on this...do you think it could work as a PhD chapter? Or is there more/less work than that.
@jjberg2 -- it may be useful to talk "live" about this. We can go over what is needed, etc..
do you think it could work as a PhD chapter? Or is there more/less work than that.
I'd say not great PhD chapter material @jjberg2-- a bunch of low level work to do, and the outcome (a new feature in msprime
) isn't really publishable on its own I reckon.
As Andy says @jjberg2, it would be hard to publish on this work. I'm happy to have a discussion about it though - please do ping me via email if you'd like to set up a meeting.