msprime icon indicating copy to clipboard operation
msprime copied to clipboard

CpG mutation models

Open hyanwong opened this issue 7 months ago • 4 comments

Elevated mutation rates at CpG dinucleotides are one of the major contributors to mutation rate variation in mammals. These can't easily be simulated by sim_mutations, but there are probably reasonable approximations if a reference sequence is defined.

@petrelharp says, on slack:

It'd take some thinking to figure out how to do it - since we can't do real context-dependence, the most obvious thing is a dinucleotide model, but then this misses half the CpG pairs. I think a pretty good approximation to real context-dependence could be done by just using the reference sequence when you don't know what the neighboring nucleotide is, but getting that into msprime would be a significant project.

The main time CpG is mentioned on the msprime GitHub repo is in https://github.com/tskit-dev/msprime/issues/972, but only in passing, so I have opened this issue for people who are searching for the term in conjunction with msprime, and who might want to think about an implementation.

hyanwong avatar Nov 14 '23 21:11 hyanwong