msprime icon indicating copy to clipboard operation
msprime copied to clipboard

How to deal with unknown rates when running simulations

Open jeromekelleher opened this issue 3 years ago • 4 comments

#1599 introduced the idea of an "unknown" rate within a RateMap using NaNs. This solves a lot of problems, but it doesn't tell us how we should run the actual simulations, which will need some rate in the regions.

As @hyanwong points out, a sensible first pass here would be to raise an error if someone puts unknown values anywhere but the flanks (where we know how to deal with them).

@hyanwong, can you summarise the state of play otherwise please?

jeromekelleher avatar Mar 29 '21 17:03 jeromekelleher

In the flanks, it's pretty clear that we just have a rate of 0, so that we don't bother simulating any breakpoints, which will just be thrown away anyway. This is true regardless of the size of the flanks: it doesn't just have to be telemetric regions. So if we want to simulate 1Mb on one arm of chr1 in humans, we can just slice the RateMap to (say) start=2e8, end=2.01e8 the NaN values in the flanks will cause those regions to be deleted from the output tree sequence, and we won't have ancestry there, or need to put mutations in them. That's a nice way to simulate a small region while keeping the normal genetic coordinates.

If, however, there's a missing rate in the middle of a chromosome, it's much more complicated. Firstly, note that this is likely to be actually pretty rare, because even across the centromere we usually have a valid and reasonable rate, calculated by looking at the probability of switching using one allele to the right and another to the left of the long centromeric region. The problem is that whatever rate we chose here will influence the linkage between regions to the left and to the right of the missing region. (At least) 3 rates could be used:

  1. A zero rate. This is akin to cutting out the unknown sections and pasting the know ends permanently together. Its main advantage is that it would result in no extra work for the simulation. But it would risk being quite wrong if there are large unknown recombination regions in the middle of a chromosome.
  2. A maximal rate (i.e. log(2) ). This would be equivalent to treating each known section as a separate chromosome. @petrelharp points out that it might cause a lot of extra work. @jeromekelleher suggests it could be implemented by "a one-base pair segment of free recombination at the end of the unknown region". I think this might be equivalent to implementing a region of constant rate log(2)/span_of_unknown_region.
  3. The average recombination rate over the known regions (RateMap.mean_rate). The argument for this is that it fills the regions in with a sensible (justifiable) value, while not causing substantially extra work, as the mean_rate is likely to be low.

Finally, note that we probably don't want to alter the recombination map to invoice its side effect of masking out e.g. the centromeric region, because then we would lose the (accurate) recombination rate over the centromere. One possibility is to create a RateMap for gene conversion (see https://github.com/tskit-dev/msprime/issues/1212) which has an unknown rate in the centromere. This is a bit ugly, e.g. in the case of no gene conversion we would implement a gene_conversion RateMap that consists of zeros and NaNs, simply to mask out the centromere. But it has the benefit of being biologically correct: we really don't know the GC rate in the centromere, which is why it gets excluded in the output TS.

hyanwong avatar Mar 29 '21 18:03 hyanwong

One thing to note about the centromeres. I agree we have accurate information about the recombination across them which we shouldn't lose, but we don't have accurate recombination data within them. Consider this example:

┌───────────────────────────────────┐
│left  │right  │   mid│  span│  rate│
├───────────────────────────────────┤
│0     │1000   │   500│  1000│  0.10│
│1000  │1110   │  1055│   110│  0.01│
│1110  │2000   │  1555│   890│  0.20│
└───────────────────────────────────┘

Suppose the centromere spans from 1000 to 1100. We have a rate associated with the interval, but it's really only measured from the variants at either end. In the middle, we just don't know.

If we simulate with these rates, we'll get recombination breakpoints and trees within the centromere, which we don't know anything about. What we want to do is to mark the bit that we don't know about as missing data so that we don't simulate any recombination in that interval, but also so that we maintain the correct recombination rate across the centromere. I'm not quite sure how we'd do that, but that's the aim anyway.

There's two reasons we want to make sure that recombination rate within the centromere is 0 when we run the simulation:

  1. We shouldn't be simulating the recombination process, only to through stuff away later if/when we do some interval deletions
  2. If we did just simulate with the stated recombination rate across the whole centromere, and afterwards do some tree sequence operations to remove the suspect regions we'd have a messy situation where we'd need to simplify out nodes were created.

So, TLDR: I agree we should have the correct recombination rate across the centromere, but we should have zero recombination within the centromere.

jeromekelleher avatar Mar 31 '21 13:03 jeromekelleher

I'm not 100% convinced. It think it depends on whether there really is (biologically) recombination within the centromere, which I understand might be the case. If there is, then perhaps we might want to simulate it, even if (later on) the mutational process is such that we can't actually spot the recombination event.

I guess the problem here is that there will be (unknown) levels of recombination within the centromere, but probably low near the middle and higher nearer the edges. Because we don't actually know the variation in rates inside the centromere, we're stuck between modelling it as a zero rate with a sudden step at (each?) end, or a constant rate across the whole thing. Neither is correct. But one might be a better approximation than another. And actually, the same applies for hard-to-sequence regions like the tandem rDNA repeats on the tips of the acrocentrics.

Eventually I assume that T2T sequencing technology will mean that we can sequence across these repeat regions anyway, so we might want to think what the right thing to do here would be if we actually had the true sequence in those regions.

hyanwong avatar Mar 31 '21 14:03 hyanwong

If the rate in a recombination map is calculated from the genetic distance, there will be no rate=NaN regions (except perhaps the flanks), right? I don't see how missingness information in internal regions of the map would be incorporated into the recombination map file without breaking the (presumably desirable) relationship between the rates and the genetic distance. So the missingness info needs to be provided from an additional source anyhow.

What if the user gave a missingness mask to sim_ancestry()? Then masked out regions at the ends of the sequence are simulated with a recombination rate of zero, and masked regions internal to the sequence are simulated with a point mass recombination rate at one site (e.g. at one end of the region) that matches the genetic length of the region. The masked out regions would also be removed automatically by sim_ancestry() in postprocessing.

grahamgower avatar Apr 06 '21 07:04 grahamgower