msprime icon indicating copy to clipboard operation
msprime copied to clipboard

Drift trajectory implementation

Open andrewkern opened this issue 4 years ago • 21 comments

this PR is dealing with issues #1762 and #1780. this implements a NeutralFixation model, which deals with a structured coalescent model akin to a selective sweep but for the case where the allele that we condition upon is neutral rather than beneficial.

This will need a rebase, etc, but wanted to get some feedback on what i've done thus far

andrewkern avatar Aug 06 '21 22:08 andrewkern

Codecov Report

Merging #1803 (ea5fc6e) into main (63d1d25) will decrease coverage by 0.00%. The diff coverage is 91.17%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1803      +/-   ##
==========================================
- Coverage   91.05%   91.04%   -0.01%     
==========================================
  Files          20       20              
  Lines       10493    10523      +30     
  Branches     2219     2225       +6     
==========================================
+ Hits         9554     9581      +27     
  Misses        489      489              
- Partials      450      453       +3     
Flag Coverage Δ
C 91.04% <91.17%> (-0.01%) :arrow_down:
python 97.53% <100.00%> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
msprime/__init__.py 100.00% <ø> (ø)
msprime/_msprimemodule.c 89.58% <66.66%> (-0.15%) :arrow_down:
lib/msprime.c 85.74% <100.00%> (+0.02%) :arrow_up:
msprime/ancestry.py 98.75% <100.00%> (+0.03%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 63d1d25...ea5fc6e. Read the comment docs.

codecov[bot] avatar Aug 06 '21 22:08 codecov[bot]

Will take a look next week--probably Tuesday.

molpopgen avatar Aug 07 '21 16:08 molpopgen

kevin note this is just the neutral fixation case, an incremental step towards getting the soft sweep model in

andrewkern avatar Aug 07 '21 19:08 andrewkern

also i should probably add verification against SLiM for patterns of linked polymorphism to this PR. Right now in verification.py i've only got tests of the sojourn time of the neutral variants against the analytic expectation (which checks out) and i'm relying on the rest of the structured coalescent machinery to just work. it should though.

andrewkern avatar Aug 07 '21 19:08 andrewkern

Yes, we don't want anything slipping through the cracks, so more verification is a good idea.

On Sat, Aug 7, 2021, 12:14 PM Andrew Kern @.***> wrote:

also i should probably add verification against SLiM for patterns of linked polymorphism to this PR. Right now in verification.py i've only got tests of the sojourn time of the neutral variants against the analytic expectation (which checks out) and i'm relying on the rest of the structured coalescent machinery to just work. it should though.

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/tskit-dev/msprime/pull/1803#issuecomment-894695693, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQ6OH4RFEZZHQUBDD7SWZ3T3WAZPANCNFSM5BWVZY6A .

molpopgen avatar Aug 07 '21 19:08 molpopgen

Yeah I agree- this is too much of the same code repeated. One thing that I was having a hard time wrapping my head around is how to generalize this sort of model and deal with the _to_lowlevel() stuff. in particularly i'm not sure the best way to deal with the parameters that are associated, e.g., this ugly thing that i've done here: https://github.com/andrewkern/msprime/blob/c7de424a686d4bba11a93e010d345b0c5cd57cfc/msprime/ancestry.py#L1969

any advice would be appreciated here Jerome.

andrewkern avatar Aug 10 '21 14:08 andrewkern

It looks like the only thing that's actually different is the trajectory generator, so what I'd like to do is to decouple the allele frequency trajectory generator from the Sweep model. Would it make sense to have something like


class TrajectoryGenerator:
    start_frequency:float
    end_frequency: float
    dt: float

class SweepModel:
     position: float
     trajectory: TrajectoryGenerator

We could then stuff like

model = SweepModel(position=0.5, trajectory=GenicSelectionTrajectory(0.1, 0.9, 1e-6))
# or 
model = SweepModel(position=0.5, trajectory=NuetralFixationTrajectory(0.1, 0.9, 1e-6))

We could keep support for the existing SweepGenicSelection class easily enough by using these building blocks.

These TrajectoryGenerator classes would front-end a bunch of stuff from the CPython layer, and would eventually end up referring to the corresponding struct + function pointer in C (a bit like ancestry models or demographic events). This would mean that we could also test the trajectory generation code independently of the actual simulations (a big issue at the moment), and also do things like allow people define trajectory generators in Python which would be super slow, but handy for prototyping (or maybe not - we're just generating numpy arrays, some models might be possible to implement efficiently using numpy).

Would that work? Are there more types of sweep that would fit into this framework, i.e., where we just define the trajectory generator and don't touch the actual ancestry simulation code at all?

If so, then maybe the best thing to do would be to back out of all the Python level changes and just commit in the nuetral trajectory in C (and maybe any others you have lying around handy) and then I'll jump in and do the plumbing?

jeromekelleher avatar Aug 10 '21 18:08 jeromekelleher

so i think it's a bit easier than this? all of the trajectories that we currently want (hard sweeps, soft sweeps, these neutral fixations) can be generated by the current C function genic_selection_generate_trajectory() (which itself should be renamed to something more general).

currently it is doing neutral fixations by asking about the selection coefficient, so what i did was tack on the python NeutralFixation model to basically do everything that the current SweepGenicSelection model was doing, but fix the selection coefficient at s=0. if we added one more parameter, call it f_0, which told the C trajectory function to switch between neutrality and selection, then we could add a SoftSweep model.

So maybe at the python level we have something like this?

class SweepModel:
     position: float
     start_frequency:float
     end_frequency: float
     dt: float

class HardSweepModel(SweepModel):
     s: float

class SoftSweepModel(SweepModel):
     s: float
     f_0: float

etc.?

andrewkern avatar Aug 10 '21 18:08 andrewkern

other use cases that I can think of that we would want to cover would be:

  1. hard partial sweeps (hard sweep that ends at freq < 1; already can do this)
  2. soft partial sweeps
  3. neutral fixations (already implemented)
  4. long term balanced polymorphism (no trajectory there, just constant deme sizes among backgrounds)

andrewkern avatar Aug 10 '21 18:08 andrewkern

Thinking ahead though @andrewkern, can you see reasons why we would want to specify different trajectories? I would rather put a bit of time in now and get the framework right. If this is what the sweep model really is (a trajectory supplied to the existing sweeps simulatation), then I'd rather specify it this way rather than trying to hide this detail.

@molpopgen, any thoughts here?

jeromekelleher avatar Aug 11 '21 05:08 jeromekelleher

so the biggest issue i see in separating the trajectory generation from the model entirely is that eventually (a few PRs down the road) we will want to get the trajectory generation connected to changing population size e.g. https://github.com/tskit-dev/msprime/blob/c7de424a686d4bba11a93e010d345b0c5cd57cfc/lib/msprime.c#L7049

i was thinking we'd do this like pop_size = get_population_size(sim->populations[0], time); so that would would need the simulation object during trajectory generation.

andrewkern avatar Aug 11 '21 14:08 andrewkern

I'm imagining having something like this:

        pop_size = get_population_size(&simulator->populations[0], sim_time);
        x = trajectory.transition_func(&trajectory, x, pop_size,  gsl_rng_uniform(rng));

Rather than:

       pop_size = get_population_size(&simulator->populations[0], sim_time);
       alpha = 2 * pop_size * trajectory.s;
        if (alpha > 0) {
            x = 1.0
                - genic_selection_stochastic_forwards(
                      trajectory.dt, 1.0 - x, alpha, gsl_rng_uniform(rng));
        } else {
            x = neutral_stochastic_backwards(trajectory.dt, x, gsl_rng_uniform(rng));
        }

So, we try to abstract out the details of everything except the core function for transitioning an allele frequency randomly. All the parameters that we need are encapsulated in the particular type of trajectory.

Would that work?

jeromekelleher avatar Aug 11 '21 14:08 jeromekelleher

Thinking ahead though @andrewkern, can you see reasons why we would want to specify different trajectories? I would rather put a bit of time in now and get the framework right. If this is what the sweep model really is (a trajectory supplied to the existing sweeps simulatation), then I'd rather specify it this way rather than trying to hide this detail.

@molpopgen, any thoughts here?

For the case of a single trajectory, restricted to no migration between demes, this is the way to go.

molpopgen avatar Aug 11 '21 15:08 molpopgen

Would that work?

I don't see any issue with this, but it'd take a good validation test to make sure.

EDIT: had forgotten that the not storing in RAM was already happening here.

molpopgen avatar Aug 11 '21 15:08 molpopgen

So, we try to abstract out the details of everything except the core function for transitioning an allele frequency randomly. All the parameters that we need are encapsulated in the particular type of trajectory.

Would that work?

Yeah I think that can work. I'd like to keep these allele freq transition functions on the C side if that's okay by you?

@molpopgen we are storing these trajectories in RAM currently, and will continue to need to for the rejection sampling that we will eventually add to deal with population size change

andrewkern avatar Aug 11 '21 16:08 andrewkern

Yeah I think that can work. I'd like to keep these allele freq transition functions on the C side if that's okay by you?

Sure yeah - I just want to make it possible to specify one of these in Python.

For the case of a single trajectory, restricted to no migration between demes, this is the way to go.

So looking ahead, is there a more complex scenario we'd want to model? A trajectory going through multiple demes that would need multiple population sizes? Or would that be better modelled with a specific class?

jeromekelleher avatar Aug 11 '21 18:08 jeromekelleher

i've never figured out the details of how to simulate alleles sweeping through multiple populations simultaneously in the coalescent. the weird bit is specifying the how / when of the beneficial allele reaching a 2nd (or 3rd etc) population in the context of migration between demes.

we can however do everything else in a multipop setting e.g., migration of neutral alleles, etc

andrewkern avatar Aug 11 '21 19:08 andrewkern

've never figured out the details of how to simulate alleles sweeping through multiple populations simultaneously in the coalescent.

Doing this right requires either some diffusion results that I've never seen (fixation pathways conditional on a specific demography) or to explicitly get trajectories via forward sims, rejection sampling until you get what you need. It is tricky.

molpopgen avatar Aug 11 '21 19:08 molpopgen

resurrected this cause of recent attention on slack, #1955, #1962

should we try to pull this functionality in finally?

andrewkern avatar Dec 15 '21 19:12 andrewkern

The thing that slowed this down, I think, is #1780. There's some idea that we need to get the right API sorted out "at once". Since we're now at 1.x, and fiddling we do risks breaking the expectations of semantic versioning. So it behooves us to get this right the first time.

molpopgen avatar Dec 21 '21 18:12 molpopgen

After spending a fair bit of time on this a few months ago trying to see what the "right" API would look like, I've come to the conclusion that this is a hard problem and realistically it's something that we'd need someone working on full time for a couple of months to really nail down. None of us have this kind of time on our hands, so I think we need to decide whether we want to wait and get it right, or patch things up as best we can for now while keeping in mind that we'll probably deprecate the current API some day, replacing with something more general (when hopefully we'll have someone to work on it).

I'd vote for patching things up as best we can for now, without adding in too much extra code complexity or duplication.

So, I guess the first thing to do is to rebase this branch and bring it up to date, and maybe resolve any issues that have been sorted out? Or perhaps it would be simpler to make a fresh PR so that we can do clean reviews?

jeromekelleher avatar Jan 05 '22 15:01 jeromekelleher