msprime
msprime copied to clipboard
Drift trajectory implementation
this PR is dealing with issues #1762 and #1780. this implements a NeutralFixation model, which deals with a structured coalescent model akin to a selective sweep but for the case where the allele that we condition upon is neutral rather than beneficial.
This will need a rebase, etc, but wanted to get some feedback on what i've done thus far
Codecov Report
Merging #1803 (ea5fc6e) into main (63d1d25) will decrease coverage by
0.00%. The diff coverage is91.17%.
@@ Coverage Diff @@
## main #1803 +/- ##
==========================================
- Coverage 91.05% 91.04% -0.01%
==========================================
Files 20 20
Lines 10493 10523 +30
Branches 2219 2225 +6
==========================================
+ Hits 9554 9581 +27
Misses 489 489
- Partials 450 453 +3
| Flag | Coverage Δ | |
|---|---|---|
| C | 91.04% <91.17%> (-0.01%) |
:arrow_down: |
| python | 97.53% <100.00%> (+<0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| msprime/__init__.py | 100.00% <ø> (ø) |
|
| msprime/_msprimemodule.c | 89.58% <66.66%> (-0.15%) |
:arrow_down: |
| lib/msprime.c | 85.74% <100.00%> (+0.02%) |
:arrow_up: |
| msprime/ancestry.py | 98.75% <100.00%> (+0.03%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 63d1d25...ea5fc6e. Read the comment docs.
Will take a look next week--probably Tuesday.
kevin note this is just the neutral fixation case, an incremental step towards getting the soft sweep model in
also i should probably add verification against SLiM for patterns of linked polymorphism to this PR. Right now in verification.py i've only got tests of the sojourn time of the neutral variants against the analytic expectation (which checks out) and i'm relying on the rest of the structured coalescent machinery to just work. it should though.
Yes, we don't want anything slipping through the cracks, so more verification is a good idea.
On Sat, Aug 7, 2021, 12:14 PM Andrew Kern @.***> wrote:
also i should probably add verification against SLiM for patterns of linked polymorphism to this PR. Right now in verification.py i've only got tests of the sojourn time of the neutral variants against the analytic expectation (which checks out) and i'm relying on the rest of the structured coalescent machinery to just work. it should though.
— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/tskit-dev/msprime/pull/1803#issuecomment-894695693, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQ6OH4RFEZZHQUBDD7SWZ3T3WAZPANCNFSM5BWVZY6A .
Yeah I agree- this is too much of the same code repeated. One thing that I was having a hard time wrapping my head around is how to generalize this sort of model and deal with the _to_lowlevel() stuff. in particularly i'm not sure the best way to deal with the parameters that are associated, e.g., this ugly thing that i've done here:
https://github.com/andrewkern/msprime/blob/c7de424a686d4bba11a93e010d345b0c5cd57cfc/msprime/ancestry.py#L1969
any advice would be appreciated here Jerome.
It looks like the only thing that's actually different is the trajectory generator, so what I'd like to do is to decouple the allele frequency trajectory generator from the Sweep model. Would it make sense to have something like
class TrajectoryGenerator:
start_frequency:float
end_frequency: float
dt: float
class SweepModel:
position: float
trajectory: TrajectoryGenerator
We could then stuff like
model = SweepModel(position=0.5, trajectory=GenicSelectionTrajectory(0.1, 0.9, 1e-6))
# or
model = SweepModel(position=0.5, trajectory=NuetralFixationTrajectory(0.1, 0.9, 1e-6))
We could keep support for the existing SweepGenicSelection class easily enough by using these building blocks.
These TrajectoryGenerator classes would front-end a bunch of stuff from the CPython layer, and would eventually end up referring to the corresponding struct + function pointer in C (a bit like ancestry models or demographic events). This would mean that we could also test the trajectory generation code independently of the actual simulations (a big issue at the moment), and also do things like allow people define trajectory generators in Python which would be super slow, but handy for prototyping (or maybe not - we're just generating numpy arrays, some models might be possible to implement efficiently using numpy).
Would that work? Are there more types of sweep that would fit into this framework, i.e., where we just define the trajectory generator and don't touch the actual ancestry simulation code at all?
If so, then maybe the best thing to do would be to back out of all the Python level changes and just commit in the nuetral trajectory in C (and maybe any others you have lying around handy) and then I'll jump in and do the plumbing?
so i think it's a bit easier than this? all of the trajectories that we currently want (hard sweeps, soft sweeps, these neutral fixations) can be generated by the current C function genic_selection_generate_trajectory() (which itself should be renamed to something more general).
currently it is doing neutral fixations by asking about the selection coefficient, so what i did was tack on the python NeutralFixation model to basically do everything that the current SweepGenicSelection model was doing, but fix the selection coefficient at s=0. if we added one more parameter, call it f_0, which told the C trajectory function to switch between neutrality and selection, then we could add a SoftSweep model.
So maybe at the python level we have something like this?
class SweepModel:
position: float
start_frequency:float
end_frequency: float
dt: float
class HardSweepModel(SweepModel):
s: float
class SoftSweepModel(SweepModel):
s: float
f_0: float
etc.?
other use cases that I can think of that we would want to cover would be:
- hard partial sweeps (hard sweep that ends at freq < 1; already can do this)
- soft partial sweeps
- neutral fixations (already implemented)
- long term balanced polymorphism (no trajectory there, just constant deme sizes among backgrounds)
Thinking ahead though @andrewkern, can you see reasons why we would want to specify different trajectories? I would rather put a bit of time in now and get the framework right. If this is what the sweep model really is (a trajectory supplied to the existing sweeps simulatation), then I'd rather specify it this way rather than trying to hide this detail.
@molpopgen, any thoughts here?
so the biggest issue i see in separating the trajectory generation from the model entirely is that eventually (a few PRs down the road) we will want to get the trajectory generation connected to changing population size e.g. https://github.com/tskit-dev/msprime/blob/c7de424a686d4bba11a93e010d345b0c5cd57cfc/lib/msprime.c#L7049
i was thinking we'd do this like pop_size = get_population_size(sim->populations[0], time); so that would would need the simulation object during trajectory generation.
I'm imagining having something like this:
pop_size = get_population_size(&simulator->populations[0], sim_time);
x = trajectory.transition_func(&trajectory, x, pop_size, gsl_rng_uniform(rng));
Rather than:
pop_size = get_population_size(&simulator->populations[0], sim_time);
alpha = 2 * pop_size * trajectory.s;
if (alpha > 0) {
x = 1.0
- genic_selection_stochastic_forwards(
trajectory.dt, 1.0 - x, alpha, gsl_rng_uniform(rng));
} else {
x = neutral_stochastic_backwards(trajectory.dt, x, gsl_rng_uniform(rng));
}
So, we try to abstract out the details of everything except the core function for transitioning an allele frequency randomly. All the parameters that we need are encapsulated in the particular type of trajectory.
Would that work?
Thinking ahead though @andrewkern, can you see reasons why we would want to specify different trajectories? I would rather put a bit of time in now and get the framework right. If this is what the sweep model really is (a trajectory supplied to the existing sweeps simulatation), then I'd rather specify it this way rather than trying to hide this detail.
@molpopgen, any thoughts here?
For the case of a single trajectory, restricted to no migration between demes, this is the way to go.
Would that work?
I don't see any issue with this, but it'd take a good validation test to make sure.
EDIT: had forgotten that the not storing in RAM was already happening here.
So, we try to abstract out the details of everything except the core function for transitioning an allele frequency randomly. All the parameters that we need are encapsulated in the particular type of trajectory.
Would that work?
Yeah I think that can work. I'd like to keep these allele freq transition functions on the C side if that's okay by you?
@molpopgen we are storing these trajectories in RAM currently, and will continue to need to for the rejection sampling that we will eventually add to deal with population size change
Yeah I think that can work. I'd like to keep these allele freq transition functions on the C side if that's okay by you?
Sure yeah - I just want to make it possible to specify one of these in Python.
For the case of a single trajectory, restricted to no migration between demes, this is the way to go.
So looking ahead, is there a more complex scenario we'd want to model? A trajectory going through multiple demes that would need multiple population sizes? Or would that be better modelled with a specific class?
i've never figured out the details of how to simulate alleles sweeping through multiple populations simultaneously in the coalescent. the weird bit is specifying the how / when of the beneficial allele reaching a 2nd (or 3rd etc) population in the context of migration between demes.
we can however do everything else in a multipop setting e.g., migration of neutral alleles, etc
've never figured out the details of how to simulate alleles sweeping through multiple populations simultaneously in the coalescent.
Doing this right requires either some diffusion results that I've never seen (fixation pathways conditional on a specific demography) or to explicitly get trajectories via forward sims, rejection sampling until you get what you need. It is tricky.
resurrected this cause of recent attention on slack, #1955, #1962
should we try to pull this functionality in finally?
The thing that slowed this down, I think, is #1780. There's some idea that we need to get the right API sorted out "at once". Since we're now at 1.x, and fiddling we do risks breaking the expectations of semantic versioning. So it behooves us to get this right the first time.
After spending a fair bit of time on this a few months ago trying to see what the "right" API would look like, I've come to the conclusion that this is a hard problem and realistically it's something that we'd need someone working on full time for a couple of months to really nail down. None of us have this kind of time on our hands, so I think we need to decide whether we want to wait and get it right, or patch things up as best we can for now while keeping in mind that we'll probably deprecate the current API some day, replacing with something more general (when hopefully we'll have someone to work on it).
I'd vote for patching things up as best we can for now, without adding in too much extra code complexity or duplication.
So, I guess the first thing to do is to rebase this branch and bring it up to date, and maybe resolve any issues that have been sorted out? Or perhaps it would be simpler to make a fresh PR so that we can do clean reviews?