tskit icon indicating copy to clipboard operation
tskit copied to clipboard

removing fixed mutations

Open petrelharp opened this issue 6 years ago • 28 comments
trafficstars

We can at times end up with a large number of fixed mutations. It would be nice to have a way to remove these. I think we discussed having simplify() do this, but it does not:

ts = msprime.simulate(10, recombination_rate=2, mutation_rate=2)
sts = ts.simplify([0,1])
sts.genotype_matrix()
### 
array([[1, 1],
       [1, 1],
       [1, 1],
       [1, 0],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1]], dtype=uint8)

This would require a bit of bookkeeping (e.g. updating the ancestral state at sites that still have a segregating mutation). Not sure whether to propose this as an argument to simplify (filter_fixed_mutations) or as its own function. This is not urgent, as far as I know, but if we were to change the default behavior of simplify, it would be nice to do it soon.

petrelharp avatar Jul 10 '19 23:07 petrelharp

I remember going through this, and it turns out to be quite tricky in the general case of lots of mutations at a site to know that it's fixed. I think the semantics that we've ended up with are good, in that we provide the option to remove objects that have no references (sites, populations, etc). Fixed sites are quite a different thing.

I think we can do it with a function easily enough though? Should be fast enough in Python, since we can do in in O(#mutations) per site, using the counting logic as we have in the site general stats algorithm.

jeromekelleher avatar Jul 11 '19 07:07 jeromekelleher

Here's a python version, is this what you had in mind?

def remove_monomorphic(sts)
  tables = sts.tables
  tables.sites.clear()
  tables.mutations.clear()
  n = sts.num_samples
  for t in sts.trees(sample_counts=True):
     for s in t.sites():
        visited = False
        for m in s.mutations:
           k = t.num_samples(m.node)
           if k > 0 and k < n:
               if not visited:
                   visited = True
                   sid = tables.sites.add_row(s.position, s.ancestral_state, metadata=s.metadata)
               tables.mutations.add_row(sid, m.node, m.derived_state, parent=-1, metadata=None)
  tables.compute_mutation_parents()
  return tables.tree_sequence()

nts = remove_monomorphic(sts)
nts.genotype_matrix()

###
array([[1, 0],
       [0, 1],
       [1, 0],
       [0, 1]], dtype=uint8)

petrelharp avatar Jul 11 '19 16:07 petrelharp

Basically, yeah. But this won't handle funky situations where we have (say) n mutations to 1 over the leaves, will it? But, we should be able to build an allele count table like we do in general_site_stats and reason using that, right?

jeromekelleher avatar Jul 12 '19 06:07 jeromekelleher

True, but I don't think we want to remove those mutations.

petrelharp avatar Jul 12 '19 09:07 petrelharp

Really? The site is still monomorphic right? What's the difference?

jeromekelleher avatar Jul 12 '19 09:07 jeromekelleher

Well, in that case everyone is IBS but not IBD. They're different cases, even if you can't tell from the sequence.

petrelharp avatar Jul 12 '19 09:07 petrelharp

I see that, but it's a statement about the sequences we're making, not the ancestry I would have thought. The semantics will get weird and strained if we don't I think (remove_monomorphic_if_single_mutation).

jeromekelleher avatar Jul 12 '19 09:07 jeromekelleher

So, I don't mind if there's a few fixed mutations in there, especially if they are due to multiple mutations at the same site. The main reason I wanted this is because after a long SLiM simulatoin, a large portion of the tree sequence file can be due to fixed mutations, even after simplification (they are still there because of the initial generation). So, this is purely pragmatic, about file size. But, I don't want this to discard the still-segregating-but-invisibly multiple IBS mutations, because that is discarding information relevant to polymorphism in the population. I can imagine someone wanting the method that you're talking about, too, but I don't think it's important because a bit of bookkeeping can be used to ignore those.

petrelharp avatar Jul 15 '19 15:07 petrelharp

For instance, with this SLiM recipe (100 individuals for 100,000 generations):

initialize()
{
    setSeed(23);
    initializeTreeSeq();
    initializeMutationRate(1e-8);
    initializeMutationType("m1", 0.5, "f", 0.0);
    initializeGenomicElementType("g1", m1, 1.0);
    initializeGenomicElement(g1, 0, 1e8-1);
    initializeRecombinationRate(1e-8);
}

1 early() {
    sim.addSubpop("p1", 100);
}

100000 {
    sim.treeSeqOutput("fixed.trees");
    catn("Done.");
    sim.simulationFinished();
}

we see that 97.6% of the mutations are fixed:

import pyslim

ts = pyslim.load("fixed.trees")
sts = ts.simplify()

freqs = ts.general_stat(np.ones(ts.num_samples).reshape((ts.num_samples, 1)), lambda x: x, windows="sites", polarised=True, span_normalise=False).reshape((ts.num_sites,))

np.mean(freqs == 200)

... and in Anastasia's simulations, this makes the metadata column longer than is allowed.

ps. we should make an "allele frequency" function, which we can do as above.

petrelharp avatar Jul 15 '19 15:07 petrelharp

Ah, I see. Should we add a filter_fixed_simple_sites (or something) option to simplify then? I think catching the single mutation fixed sites is pretty easy? I don't want to call it monomorphic or whatever as I think this is misleading in the multiple mutation case.

jeromekelleher avatar Jul 15 '19 16:07 jeromekelleher

ps. we should make an "allele frequency" function, which we can do as above.

+1

jeromekelleher avatar Jul 15 '19 16:07 jeromekelleher

Just filter_fixed_mutations should do it? (i.e., fixed mutations, not fixed alleles...)

petrelharp avatar Jul 15 '19 17:07 petrelharp

Just filter_fixed_mutations should do it? (i.e., fixed mutations, not fixed alleles...)

... Yeah, OK. But this implies that we only remove the mutation, not the site. However, if this is applied before filter_sites, we'd remove the site because any fixed mutations would have already been filtered out.

By fixed mutation, we mean there is exactly one root and there is a mutation over it. What if there are multiple mutations over the root? Do remove all of them, or do we leave them be? (Seems best to remove all of them?)

jeromekelleher avatar Jul 15 '19 18:07 jeromekelleher

Hm, well we don't necessarily remove the site, since what if there's a polymorphic mutation at it still...

Do we remove all of them, or do we leave them be? (Seems best to remove all of them?)

All of them!

petrelharp avatar Jul 15 '19 19:07 petrelharp

Hm, well we don't necessarily remove the site, since what if there's a polymorphic mutation at it still...

We'll only remove it if filter_sites is True and the number of mutations after filter_fixed_mutations has been applied is zero.

OK, this sounds like a plan then? Who wants to implement it?

jeromekelleher avatar Jul 15 '19 20:07 jeromekelleher

Over in #361 (which I suggest we merge into this one), @hyanwong suggests that we make it possible to filter not only fixed mutations (ie ones that all samples have) but more generally, mutations above nodes with no parents. We don't want to do only this, since in some applications we definately want to keep these (for instance, in slim pre-recapitation), but this could be another option. I suspect we could implement this at the same time, but haven't thought through the details.

petrelharp avatar Sep 09 '19 17:09 petrelharp

Yes, this should be doable I think, and I agree it's a good option to have.

jeromekelleher avatar Sep 11 '19 11:09 jeromekelleher

Is this still something we want @petrelharp?

jeromekelleher avatar Sep 29 '20 15:09 jeromekelleher

Yes! Sorry I haven't got to it.

petrelharp avatar Sep 29 '20 15:09 petrelharp

Just to return to this as @jeromekelleher noticed a load of fixed mutations in our workbooks after simplification, and thought it was odd. So that's another vote to implement this functionality.

hyanwong avatar Jun 16 '22 10:06 hyanwong

Huh, I thought we had an option for this in simplify. It's probably quite tricky to get right in the general case, I'm certain I thought about this at the time.

jeromekelleher avatar Jun 16 '22 10:06 jeromekelleher

You did, and it was tricky, but it's probably a good bit easier now - we've got more tools.

petrelharp avatar Jun 25 '22 04:06 petrelharp

The filter_all idea in #2681 suggests that all the filter_XXX args are tweaked at the same time. Currently filter_X only exists where X is a table. So I wonder if filter_fixed_mutations is the right name?

hyanwong avatar Jan 12 '23 11:01 hyanwong

Just returning to this, there is some thinking to do for the case of multiple roots, right? What if there are two roots, and identical mutations over each? I suspect in this case we don't remove the mutation. So this function will only work when there is a single root in the tree?

hyanwong avatar Mar 30 '23 10:03 hyanwong

That sounds good to me - the mutations aren't fixed, from the point of view of the trees; if someone wants to filter on allele frequency they can do that another way.

petrelharp avatar Mar 31 '23 14:03 petrelharp