msprime icon indicating copy to clipboard operation
msprime copied to clipboard

Add ``sim`` top level function

Open jeromekelleher opened this issue 3 years ago • 12 comments

It would be handy to have a top-level function that does simple simulations of ancestry and mutations. How about we do something like

def sim(
    samples, 
    *,  
    sequence_length=None, 
    recombination_rate=None, 
    mutation_rate=None, 
    demography=None,  
    discrete_genome=None, 
    random_seed=None
):
    seeds = generate_seeds(random_seed)
    ts = sim_ancestry(
        samples, sequence_length=sequence_length, 
        recombination_rate=recombination_rate, 
        demography=demography, 
        discrete_genome=discrete_genome,
        random_seed=seeds[0])
    return sim_mutations(ts, rate=mutation_rate, random_seed=seeds[1], discrete_genome=discrete_genome)

We can add things like ancestry_model and mutation_model etc as time goes by, but this much would satisfy 99% of the use cases I think (which is to do quick, simple simulations).

If you want to do complex stuff like specify the initial_state etc, then you need to use sim_ancestry directly.

What do we think?

@hyanwong, you've been asking for this for a while - can you link to the issues involved please?

jeromekelleher avatar Mar 16 '21 10:03 jeromekelleher

I'm not sure what the gain is here - if there were more shared arguments it would make sense. To me it feels like you're hiding the key concept of mutations being placed after simulation of ancestry to just lose one line of code?

benjeffery avatar Mar 16 '21 11:03 benjeffery

Fair point. Just seems like a handy way of avoiding a bit of typing for common tasks, but maybe it would cause confusion.

jeromekelleher avatar Mar 16 '21 12:03 jeromekelleher

I'm in favor of this - it'd be nice for the casual/new user to not have to understand the point about mutations coming after ancestry. But as Ben says, it's just one line of code...

petrelharp avatar Mar 16 '21 12:03 petrelharp

I'm in favor of this - it'd be nice for the casual/new user to not have to understand the point about mutations coming after ancestry.

I think forcing the user to understand this point is actually beneficial. The separation of sim_ancestry()/sim_mutations() makes it more explicit to users that the simulated tree(s) are necessarily independent of the mutations.

Plus, the zen of Python says:

There should be one-- and preferably only one --obvious way to do it.

grahamgower avatar Mar 16 '21 13:03 grahamgower

Well, two dissenting opinions is enough for me to not be motivated enough to do it for 1.0. Let's see how we go without it, and put it in later if it's something we miss having.

jeromekelleher avatar Mar 16 '21 13:03 jeromekelleher

Well, two dissenting opinions is enough for me to not be motivated enough to do it for 1.0. Let's see how we go without it, and put it in later if it's something we miss having.

Sure. We can add it later if others prefer. My main reason is that I often use msprime for generating a test tree sequence, e.g. msprime.simulate(10, mutation_rate=1, random_seed=1) to mess about with. It's more of a hassle to do with 2 calls, and (more worryingly) if you want a deterministic result, it's easy to accidentally specify the random_seed value for only one of the functions.

But I can continue to use .simulate() for this. Normally when I'm doing it, I don't actually care what the mutation model is anyway, or where recombinations occur. So I'm happy with what others think best.

I think if you call it sim_ancestry_and_mutations it partially negates @grahamgower 's point. But it makes it long-winded to type, which isn't so good for my quick-and-dirty-use-case example.

hyanwong avatar Mar 16 '21 14:03 hyanwong

(@hyanwong - do you have links to the other threads where this has come up? I couldn't find them)

jeromekelleher avatar Mar 16 '21 16:03 jeromekelleher

(@hyanwong - do you have links to the other threads where this has come up? I couldn't find them)

I think it was almost entirely discussed on Slack. It's mentioned in passing at https://github.com/tskit-dev/msprime/issues/1119#issuecomment-672878670 but I couldn't find other refs.

hyanwong avatar Mar 16 '21 17:03 hyanwong

I'm in the middle of doing something that illustrates a use-case for a top-level simXXX function, so thought I should post it here:

import itertools
import msprime
times = []
start = time.time()
for seed in itertools.count(1):
    ts = msprime.sim_mutations(
        msprime.sim_ancestry(1, sequence_length=100, population_size=1000, random_seed=seed),
        rate=1e-5, random_seed=seed)
    # rejection sampling: only accept those with a single mutation above the RH node
    if ts.num_mutations == 1 and ts.mutation(0).node == 1:
        times.append(ts.node(2).time)
    if seed % 1000 == 0:
        print(seed, len(times), "in", time.time()-start, "seconds")
    if len(times) > 1e4:
        break

You can see the repeat of random_seed is a bit annoying. But perhaps I should just use simulate() here anyway, as the combining sim_ancestry and sim_mutations results in a loop that runs at about half the speed of a simple simulate call (perhaps that's because the new version is also creating an individual each time).

hyanwong avatar Mar 24 '21 15:03 hyanwong

Interesting. I wouldn't see much point in actually setting seeds here, though. Running replicates via num_replicates might be a bit more efficient too, although this is such a tiny simulation that it's all going to be overheads anyway.

jeromekelleher avatar Mar 24 '21 16:03 jeromekelleher

Yeah, thanks. It is a lot quicker using num_replicates. True about setting the seed, although it's nice to be able to make it replicable.

hyanwong avatar Mar 24 '21 17:03 hyanwong

I think there's probably less reason for this if we have #1786?

jeromekelleher avatar Jul 28 '21 07:07 jeromekelleher