msprime
msprime copied to clipboard
Class design for "selective sweep" concept.
This issue is motivated by #1762
The current class for modeling the structured coalescent is msprime.SweepGenicSelection
. This class handles the following scenario:
- Given a start/end frequency (e.g., moving backwards in time so that end < start), we generate a trajectory using the conditional diffusion methods described in Coop and Griffiths (2004). The trajectory of a mutation with fitnesses 1, 1+s/2 and 1+s for the 3 diploid genotypes, conditional on it moving from frequency
start
toend
. - That trajectory is associated with a genomic position.
- Given that trajectory, we do the operations described here to generate the history of the two parts of the population--lineages linked to the site and those that have recombined away from it (or weren't linked in the first place if start frequency is low enough).
The init
arguments that matter most are start/end frequencies and s
.
Moving forwards, there are many more modeling possibilities:
- Handling arbitrary dominance, which amounts to a new trajectory-making function.
- Adding an option that guarantees that the model end at a frequency 1/(2N). (Not guaranteeing this is the cause of #1762.) We want this to be optional so that one could chain together multiple sweep models to do "stranger" things like stitch together trajectories where
s
and/or dominance vary over time. By default, we probably want this feature to be on, reflecting the most common anticipated use case. - Allowing for population size changes, and possibly migration of the beneficial allele, thus getting closer to feature parity with discoal.
So, at the very least, we want the init
method to account for:
- Dominance, via an
h
kwarg
- Guaranteeing that the trajectory goes down to a very small frequency, a.k.a "1/2N".
We also need a name, bearing in mind the assumptions as I understand them:
- The Hudson model (as opposed to DWTF or others).
- The trajectory applies to a single site and a mutation that arose in a single copy at some point in the past. (In other words, we are not allowing for the other flavor of sweeps from standing variation by which the beneficial allele was present in multiple copies of different evolutionary origin at the onset of selection. Nor are we allowing recurrent mutation to the beneficial allele during the "selective phase" of the simulation.)
I've likely missed a few things.
cc @andrewkern @jeromekelleher
One thing I forgot: allowing start == end
along with some concept of "forever", allowing the strong balancing selection scenario of Kaplan, Darden, and Hudson?
this is great @molpopgen! i'm totally game to work on implementing these things.
while dealing with dominance is important eventually, i think the first thing to knock out is the selection from standing variation scenario to deal with the issue in #1762
all this should amount to is implementing drift/neutral trajectories which I have already done. I just need to finish up the testing and get a PR together.
another big priority for me is to get the population size changes done and incorporating the rejection sampling of trajectories that i've done in discoal.
For naming-- I'd suggest we set up models like "hard_sweep" , "recurrent_hard_sweep", "soft_sweep", etc. How does that strike you?
Sounds great to me @molpopgen, thanks a million for putting this together.
Naming is tricky because it is related to what's under the hood, at least a bit:
- We could have a single class,
msprime.HudsonStructuredCoalescentWithASingleTrajectory
. This is explicit, but there's a good chance of lots ofkwargs
, all possible subsets of which cause confusion. But the plus side is a single point of failure when it comes to checking input. - We probably want to avoid inheritance:
class msprime.HardSweep(msprime._HudsonStructuredBlahBlahBlah)
. That gets us into thesuper()
pickling mess and some other confusion. - So, we could have lots of individual "sweep model classes", with simple
init
functions, and each class holds an instance of the richer_HudsonStructuredThing
.
But all of the above is related to what the C would look like under all of this.
No matter what, we want the defaults to reflect common use cases, allowing you to do things like remake a figure from a 2002 paper in 1/2 screen of Python code.
No matter what, we want the defaults to reflect common use cases, allowing you to do things like remake a figure from a 2002 paper in 1/2 screen of Python code.
let's do it. want to review some code for me @molpopgen as I pull the bits together?
let's do it. want to review some code for me @molpopgen as I pull the bits together?
Yeah, definitely. I'm quite interested, because I had a near-complete diploSHIC workflow in place using this, until #1762 got in the way!
diploSHIC training is gonna fly once this is done.
diploSHIC training is gonna fly once this is done.
Yeah, in my little test work flows, it was about 30% faster. There's considerable file I/O + time spend in allel
that dominates still.
would be great to side step that all using tskit stats....
would be great to side step that all using tskit stats....
Yes, the challenge there is that many of the stats are haplotype-y, and many such stats have an ad-hoc flavor to them. The whole set of D-like
stats can be don using the existing code, but for many of the others, there's some thinking that needs doing.