msprime
msprime copied to clipboard
Add ``sim`` top level function
It would be handy to have a top-level function that does simple simulations of ancestry and mutations. How about we do something like
def sim(
samples,
*,
sequence_length=None,
recombination_rate=None,
mutation_rate=None,
demography=None,
discrete_genome=None,
random_seed=None
):
seeds = generate_seeds(random_seed)
ts = sim_ancestry(
samples, sequence_length=sequence_length,
recombination_rate=recombination_rate,
demography=demography,
discrete_genome=discrete_genome,
random_seed=seeds[0])
return sim_mutations(ts, rate=mutation_rate, random_seed=seeds[1], discrete_genome=discrete_genome)
We can add things like ancestry_model
and mutation_model
etc as time goes by, but this much would satisfy 99% of the use cases I think (which is to do quick, simple simulations).
If you want to do complex stuff like specify the initial_state
etc, then you need to use sim_ancestry
directly.
What do we think?
@hyanwong, you've been asking for this for a while - can you link to the issues involved please?
I'm not sure what the gain is here - if there were more shared arguments it would make sense. To me it feels like you're hiding the key concept of mutations being placed after simulation of ancestry to just lose one line of code?
Fair point. Just seems like a handy way of avoiding a bit of typing for common tasks, but maybe it would cause confusion.
I'm in favor of this - it'd be nice for the casual/new user to not have to understand the point about mutations coming after ancestry. But as Ben says, it's just one line of code...
I'm in favor of this - it'd be nice for the casual/new user to not have to understand the point about mutations coming after ancestry.
I think forcing the user to understand this point is actually beneficial. The separation of sim_ancestry()/sim_mutations() makes it more explicit to users that the simulated tree(s) are necessarily independent of the mutations.
Plus, the zen of Python says:
There should be one-- and preferably only one --obvious way to do it.
Well, two dissenting opinions is enough for me to not be motivated enough to do it for 1.0. Let's see how we go without it, and put it in later if it's something we miss having.
Well, two dissenting opinions is enough for me to not be motivated enough to do it for 1.0. Let's see how we go without it, and put it in later if it's something we miss having.
Sure. We can add it later if others prefer. My main reason is that I often use msprime for generating a test tree sequence, e.g.
msprime.simulate(10, mutation_rate=1, random_seed=1)
to mess about with. It's more of a hassle to do with 2 calls, and (more worryingly) if you want a deterministic result, it's easy to accidentally specify the random_seed
value for only one of the functions.
But I can continue to use .simulate()
for this. Normally when I'm doing it, I don't actually care what the mutation model is anyway, or where recombinations occur. So I'm happy with what others think best.
I think if you call it sim_ancestry_and_mutations
it partially negates @grahamgower 's point. But it makes it long-winded to type, which isn't so good for my quick-and-dirty-use-case example.
(@hyanwong - do you have links to the other threads where this has come up? I couldn't find them)
(@hyanwong - do you have links to the other threads where this has come up? I couldn't find them)
I think it was almost entirely discussed on Slack. It's mentioned in passing at https://github.com/tskit-dev/msprime/issues/1119#issuecomment-672878670 but I couldn't find other refs.
I'm in the middle of doing something that illustrates a use-case for a top-level simXXX
function, so thought I should post it here:
import itertools
import msprime
times = []
start = time.time()
for seed in itertools.count(1):
ts = msprime.sim_mutations(
msprime.sim_ancestry(1, sequence_length=100, population_size=1000, random_seed=seed),
rate=1e-5, random_seed=seed)
# rejection sampling: only accept those with a single mutation above the RH node
if ts.num_mutations == 1 and ts.mutation(0).node == 1:
times.append(ts.node(2).time)
if seed % 1000 == 0:
print(seed, len(times), "in", time.time()-start, "seconds")
if len(times) > 1e4:
break
You can see the repeat of random_seed is a bit annoying. But perhaps I should just use simulate()
here anyway, as the combining sim_ancestry
and sim_mutations
results in a loop that runs at about half the speed of a simple simulate
call (perhaps that's because the new version is also creating an individual each time).
Interesting. I wouldn't see much point in actually setting seeds here, though. Running replicates via num_replicates
might be a bit more efficient too, although this is such a tiny simulation that it's all going to be overheads anyway.
Yeah, thanks. It is a lot quicker using num_replicates. True about setting the seed, although it's nice to be able to make it replicable.
I think there's probably less reason for this if we have #1786?