msprime
msprime copied to clipboard
Experimental yaml input format
This is an experiment to see what a yaml/json input format (building on demes) would look like. It mostly works I think, except for the basic confusion about the direction of time. We can easily imagine adding to this to allow for things like recombination maps.
Here's an example input file:
demography:
# This is an **embedded** Demes yaml model.
time_units: generations
demes:
- name: X
epochs: [{end_time: 1000, start_size: 2000}]
- name: A
ancestors: [X]
epochs: [{start_size: 2000}]
- name: B
ancestors: [X]
epochs: [{start_size: 2000}]
# Note: We are **referring** to the Demes model here.
samples: {A: 100, B: 100}
sequence_length: 100000
recombination_rate: 1e-8
ploidy: 1
model: hudson
The idea is that we embed the Demes yaml description within the larger simulation configuration context. When we're parsing the input yaml, we just hand-off the parsing of the demography
object to demes-python
which will do all the hard work for us.
I'm not suggesting this as a general specification for popgen simulations, I just want to illustrate the power that we get from keeping Demes simple and self-contained. To me, the ability to make a simple configuration file for a specific simulator like this is a powerful argument for not over-specifying the standard. The more bells and whistles we add to the spec the less likely it is that it'll be compatible across different simulators.
Any thoughts @molpopgen @grahamgower @apragsdale? I've been talking about simulation configurations being able to "refer" to elements of the Demes model for a while, and this is an attempt to make things concrete. (I guess we shouldn't get into detailed discussions about Demes itself here though: if someone wants to follow up, maybe create an issue on the spec repo to discuss?)
Codecov Report
Merging #1842 (2f8956e) into main (6a9c603) will decrease coverage by
0.18%
. The diff coverage is50.98%
.
@@ Coverage Diff @@
## main #1842 +/- ##
==========================================
- Coverage 90.46% 90.28% -0.19%
==========================================
Files 20 21 +1
Lines 10682 10733 +51
Branches 2167 2174 +7
==========================================
+ Hits 9664 9690 +26
- Misses 572 597 +25
Partials 446 446
Flag | Coverage Δ | |
---|---|---|
C | 90.28% <50.98%> (-0.19%) |
:arrow_down: |
python | 96.89% <50.98%> (-0.63%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
msprime/json_input.py | 45.16% <45.16%> (ø) |
|
msprime/cli.py | 96.94% <52.94%> (-1.58%) |
:arrow_down: |
msprime/mutations.py | 98.59% <100.00%> (+0.02%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update abe6116...2f8956e. Read the comment docs.
Two thoughts:
- it seems like this could be a nice bridge for the folks who aren't comfortable in python? It might be worth finding some of those people to test it out on.
- Perhaps this all should be within an
ancestry:
block, to be followed by amutations:
block (and then maybe anoutput:
block?) for a more complete specificatoin?
I agree with @petrelharp that it maybe needs to have separate ancestry:
and mutations:
blocks. But then it doesn't neatly align with the current CLI msp ancestry
subcommand. Also, maybe the demography could be either inline or refer to a file path?
Thanks, great points @petrelharp and @grahamgower ! I think a combined ancestry
and mutation
format is the right approach, and yes, this would be a good bridge for people who aren't comfortable with Python.
WRT to the CLI, I've already created an msp ancestry-yaml
as a quick way of getting something working without having to worry about the semantics of msp ancestry
. So, we just need a command to run a simulation from a yaml config. Unfortunately msp simulate
is already used as the legacy interface. We could do msp yaml
?
Update: I've added the proposed mutations/ancestry sections and the config looks like this now:
ancestry:
sequence_length: 100000
recombination_rate: 1e-8
samples: {A: 100, B: 100}
ploidy: 1
model: hudson
demography:
time_units: generations
demes:
- name: X
epochs: [{end_time: 1000, start_size: 2000}]
- name: A
ancestors: [X]
epochs: [{start_size: 2000}]
- name: B
ancestors: [X]
epochs: [{start_size: 2000}]
mutations:
rate: 1e-8
model: blosum62
To make this fully general we'd need to
- Add support for reading RateMaps from dictionaries (easy)
- Support parsing Ancestry and Mutation models from dictionaries (should be pretty easy, this is basically what we turn the classes into anyway). Since the ancestry models use a duration, we actually sidestep the awkward time business
- Think properly about time and implement start_time and end_time accordingly (but, these are pretty niche options, so could just be dropped)
This looks really nice to me. Agree that ancestry/mutations/output blocks makes a lot of sense, and those updates look clean. If I'm reading the changes correctly, you can place any valid argument to sim_ancestry and sim_mutations into this yaml? So specify seeds, or more complicated models (e.g. dtfw then switch to hudson), etc. For an "output" block, it might be nice to be able to specify "trees" vs "vcf", plus all the bells and whistles that go with those. Not sure how general you intend this input approach to be.
Overall, I think this would be a nice middle ground between avoiding both python scripting and the cli (which can sometimes be confusing for some). Looking forward to discussing more today in a bit.
I like the approach overall. I think embedding the demes
bits is quite elegant.