tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Require mutation parents for TreeSequence instantiation

Open jeromekelleher opened this issue 2 years ago • 8 comments

It has popped up in a few places that our laxity in not checking the correctness of mutation parents at TreeSequence initialisation was a mistake, and we really should be validating this part of the data model. I think the main reason we didn't enforce it at the time was a fear of breaking existing code (but I could well be wrong)

refs

  • #2730
  • #2729
  • https://github.com/tskit-dev/tskit/issues/2631#issuecomment-1485835240

jeromekelleher avatar Apr 03 '23 15:04 jeromekelleher

I've run across at least one other possible reason we didn't do it: currently sort doesn't put mutations in mutation-parent order:

Mutations are sorted by site ID, and within the same site are sorted by time. Those with equal or unknown time retain their relative ordering. This does not currently rearrange tables so that mutations occur after their mutation parents, which is a requirement for valid tree sequences.

(I think maybe 'cause it's hard? This is ringing a bell...)

petrelharp avatar Apr 03 '23 22:04 petrelharp

Riiight. Maybe the topological sort algorithm that @benjeffery developed for pedigrees could be used for this?

jeromekelleher avatar Apr 04 '23 10:04 jeromekelleher

Other past discussion:

  • #651
  • #1097
  • #1197
  • #1227

I haven't yet run across any fundamental reasons, other than "it's a bit icky".

petrelharp avatar Apr 04 '23 15:04 petrelharp

Starting to look at this, here's my understanding of the issue.

For mutations we currently have checks for:

  • site, node, parents basic bounds checking
  • can't parent self
  • if time is known, it is finite and older or equal to node time
  • times at a site must all be known or unknown
  • parent must be same site
  • time younger or equal to parent
  • time must be younger or equal to parent node time
  • mutations ordered by site
  • parents sorted before children
  • parent time older than child

What we don't check for (and this issue is suggesting) is that if a mutation has a parent by topology that that parent is recorded in the mutation.

By "parent by topology" I mean if in tracing a path from a mutation to a root we pass through another mutation at that site, then that mutation must be marked as the parent of the first.

The most common cause of this being that no parents were set at all for any mutations. This is often completely valid as mutations might be on independent branches, or not trace to a common root.

It looks like this would be easiest to do in tsk_table_collection_check_tree_integrity where we already have the node topology constructed, a sketch would be:

for each site
    build a node->mutation mapping
    for each mutation at the site
        focal_node = mutation.node
        while focal_node != NULL
            check the mapping for a mutation at focal_node - if found this is the topological parent - break
            focal_node = parent[focal_node]

You can optimise by continuing the traverse and later skipping the mutation you just found, and by starting at the end of the mutation list which is the children due to the time sorting requirement, and not checking the final mutation you find. There's some details I'm glossing over with finding mutations at the same node as your focal mutation.

@jeromekelleher Does this sound in the right ballpark? It sounds quite an expensive check, unless there is a quick way I'm missing.

benjeffery avatar Jun 10 '25 01:06 benjeffery

I think the first iteration of this would be to just call compute_mutation_parents (or the underlying algorithm used) after the existing checks have passed, and then check that the results of mutations.parent are the same. We could then see what the actual computational cost of doing this is, and see if it's worth doing something more focused afterwards.

jeromekelleher avatar Jun 10 '25 08:06 jeromekelleher

Here's a quick check in Python, plotting the ratio of tskit.load to tables.compute_mutation_parents. It's not an insignificant fraction of the load time. I assume the method above would be faster - but not more than 10x, probably less.

The perf is good enough though that I think the strategy here is to get this "slow but correct" version in and if we're happy (and have time!) do something better as that later change would be non-breaking.

Image

benjeffery avatar Jun 10 '25 09:06 benjeffery

Probably adding the lowest-level compute_mutation_times bits in C would reduce that - I'd imagine tables.compute_mutation_times is already doing some of the checking

jeromekelleher avatar Jun 10 '25 09:06 jeromekelleher

The graph above is misleading as tsk_table_collection_compute_mutation_parents does a full tsk_table_collection_check_integrity call. This is good news as it will be much faster. Part way through pulling that out so the internal algorithm can be called without the check.

benjeffery avatar Jun 10 '25 13:06 benjeffery

Fixed in #3212

benjeffery avatar Sep 23 '25 16:09 benjeffery