msprime icon indicating copy to clipboard operation
msprime copied to clipboard

Experimental yaml input format

Open jeromekelleher opened this issue 2 years ago • 7 comments

This is an experiment to see what a yaml/json input format (building on demes) would look like. It mostly works I think, except for the basic confusion about the direction of time. We can easily imagine adding to this to allow for things like recombination maps.

Here's an example input file:

demography:
  # This is an **embedded** Demes yaml model.
  time_units: generations
  demes:
    - name: X
      epochs: [{end_time: 1000, start_size: 2000}]
    - name: A
      ancestors: [X]
      epochs: [{start_size: 2000}]
    - name: B
      ancestors: [X]
      epochs: [{start_size: 2000}]

# Note: We are **referring** to the Demes model here.
samples: {A: 100, B: 100}
sequence_length: 100000
recombination_rate: 1e-8
ploidy: 1
model: hudson

The idea is that we embed the Demes yaml description within the larger simulation configuration context. When we're parsing the input yaml, we just hand-off the parsing of the demography object to demes-python which will do all the hard work for us.

I'm not suggesting this as a general specification for popgen simulations, I just want to illustrate the power that we get from keeping Demes simple and self-contained. To me, the ability to make a simple configuration file for a specific simulator like this is a powerful argument for not over-specifying the standard. The more bells and whistles we add to the spec the less likely it is that it'll be compatible across different simulators.

Any thoughts @molpopgen @grahamgower @apragsdale? I've been talking about simulation configurations being able to "refer" to elements of the Demes model for a while, and this is an attempt to make things concrete. (I guess we shouldn't get into detailed discussions about Demes itself here though: if someone wants to follow up, maybe create an issue on the spec repo to discuss?)

jeromekelleher avatar Sep 18 '21 15:09 jeromekelleher

Codecov Report

Merging #1842 (2f8956e) into main (6a9c603) will decrease coverage by 0.18%. The diff coverage is 50.98%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1842      +/-   ##
==========================================
- Coverage   90.46%   90.28%   -0.19%     
==========================================
  Files          20       21       +1     
  Lines       10682    10733      +51     
  Branches     2167     2174       +7     
==========================================
+ Hits         9664     9690      +26     
- Misses        572      597      +25     
  Partials      446      446              
Flag Coverage Δ
C 90.28% <50.98%> (-0.19%) :arrow_down:
python 96.89% <50.98%> (-0.63%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
msprime/json_input.py 45.16% <45.16%> (ø)
msprime/cli.py 96.94% <52.94%> (-1.58%) :arrow_down:
msprime/mutations.py 98.59% <100.00%> (+0.02%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update abe6116...2f8956e. Read the comment docs.

codecov[bot] avatar Sep 18 '21 15:09 codecov[bot]

Two thoughts:

  • it seems like this could be a nice bridge for the folks who aren't comfortable in python? It might be worth finding some of those people to test it out on.
  • Perhaps this all should be within an ancestry: block, to be followed by a mutations: block (and then maybe an output: block?) for a more complete specificatoin?

petrelharp avatar Sep 19 '21 22:09 petrelharp

I agree with @petrelharp that it maybe needs to have separate ancestry: and mutations: blocks. But then it doesn't neatly align with the current CLI msp ancestry subcommand. Also, maybe the demography could be either inline or refer to a file path?

grahamgower avatar Sep 20 '21 07:09 grahamgower

Thanks, great points @petrelharp and @grahamgower ! I think a combined ancestry and mutation format is the right approach, and yes, this would be a good bridge for people who aren't comfortable with Python.

WRT to the CLI, I've already created an msp ancestry-yaml as a quick way of getting something working without having to worry about the semantics of msp ancestry. So, we just need a command to run a simulation from a yaml config. Unfortunately msp simulate is already used as the legacy interface. We could do msp yaml?

jeromekelleher avatar Sep 20 '21 09:09 jeromekelleher

Update: I've added the proposed mutations/ancestry sections and the config looks like this now:

ancestry:
  sequence_length: 100000
  recombination_rate: 1e-8
  samples: {A: 100, B: 100}
  ploidy: 1
  model: hudson
  demography:
    time_units: generations
    demes:
      - name: X
        epochs: [{end_time: 1000, start_size: 2000}]
      - name: A
        ancestors: [X]
        epochs: [{start_size: 2000}]
      - name: B
        ancestors: [X]
        epochs: [{start_size: 2000}]

mutations:
  rate: 1e-8
  model: blosum62

To make this fully general we'd need to

  1. Add support for reading RateMaps from dictionaries (easy)
  2. Support parsing Ancestry and Mutation models from dictionaries (should be pretty easy, this is basically what we turn the classes into anyway). Since the ancestry models use a duration, we actually sidestep the awkward time business
  3. Think properly about time and implement start_time and end_time accordingly (but, these are pretty niche options, so could just be dropped)

jeromekelleher avatar Sep 20 '21 14:09 jeromekelleher

This looks really nice to me. Agree that ancestry/mutations/output blocks makes a lot of sense, and those updates look clean. If I'm reading the changes correctly, you can place any valid argument to sim_ancestry and sim_mutations into this yaml? So specify seeds, or more complicated models (e.g. dtfw then switch to hudson), etc. For an "output" block, it might be nice to be able to specify "trees" vs "vcf", plus all the bells and whistles that go with those. Not sure how general you intend this input approach to be.

Overall, I think this would be a nice middle ground between avoiding both python scripting and the cli (which can sometimes be confusing for some). Looking forward to discussing more today in a bit.

apragsdale avatar Sep 20 '21 14:09 apragsdale

I like the approach overall. I think embedding the demes bits is quite elegant.

molpopgen avatar Sep 20 '21 21:09 molpopgen