tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Add tree sequence initialization flag for "compute mutation parents"

Open daikitag opened this issue 2 years ago • 9 comments
trafficstars

When I try to construct a tree sequence through a mutation table,

  • time must either be UNKNOWN_TIME (a NAN value which indicates the time is unknown) or be a finite value which is greater or equal to the mutation node’s time, less than the node above the mutation’s time and equal to or less than the time of the parent mutation if this mutation has one. If one mutation on a site has UNKNOWN_TIME then all mutations at that site must, as a mixture of known and unknown is not valid.
  • mutation must be sorted by site ID

in the mutation requirements (https://tskit.dev/tskit/docs/stable/data-model.html#sec-mutation-requirements) are detected as errors. However, codes such as

ts = tskit.Tree.generate_comb(4, span=10).tree_sequence
tables = ts.dump_tables()
tables.sites.add_row(0, "A")
tables.mutations.add_row(site=0, node=3, derived_state="T")
tables.mutations.add_row(site=0, node=5, derived_state="T")
ts = tables.tree_sequence()

(Mutation ID 0 occurs at a later time than Mutation ID 1)

and

ts = tskit.Tree.generate_comb(4, span=10).tree_sequence
tables = ts.dump_tables()
tables.sites.add_row(0, "A")
tables.mutations.add_row(site=0, node=0, derived_state="T")
tables.mutations.add_row(site=0, node=0, derived_state="G")
ts = tables.tree_sequence()

(Mutation ID 1 should have Mutation ID 0 as its parent, though ts.mutation(1) has parent=-1 instead of parent=0)

won't be detected as errors, even though they are violating the mutation requirements of tree sequence data. I would appreciate any insights on this topic.

daikitag avatar May 22 '23 13:05 daikitag

Excellent question @daikitag! This is a limitation in the input validation in for mutations, which we are increasingly coming to agree is a bug.

The second issue example is the same issue as #2732

I'm not sure the first one is actually a bug - can you paste in the actual tree here so we can see it (print(ts.draw_text()) would be the most helpful output

jeromekelleher avatar May 22 '23 16:05 jeromekelleher

foo

The first seems like a bug in light of:

If another mutation occurs on the tree above the mutation in question, its ID must be listed as the parent.

molpopgen avatar May 22 '23 16:05 molpopgen

But is the text vague here? Are the requirements saying you should either use parent == NULL everywhere OR you have to do the "futhermore" bit?

molpopgen avatar May 22 '23 16:05 molpopgen

I got the same output as @molpopgen posted, and I think it is violating

  • when there are multiple mutations per site, mutations should be ordered by decreasing time, if known, and parent mutations must occur before their children

because mutation at node 3 in site 0 cannot occur before mutation at node 5 in site 0.

I just have a suggestion, but would it be possible for us to slightly modify the codes such that an error appears when there are multiple mutations with parent=-1 in the same edge, etc, because some algorithms use parent information in tree traversal?

daikitag avatar May 22 '23 17:05 daikitag

Hmm, this is a bit of a mess - I think we'll just have to write some code to do the actual requirements checking, and then see how much code we'd break by using it, and whether there's any noticeable perf consequences.

jeromekelleher avatar May 23 '23 08:05 jeromekelleher

It seems like we may need something to populate the mutation parent field? It will be a big hit in the API ergonomics to have to track all of that, especially when making example tree sequences "manually".

molpopgen avatar May 23 '23 12:05 molpopgen

Yes... but I guess throwing an error saying "you should run tables.compute_mutation_parents()`` would mitigate this?

jeromekelleher avatar May 23 '23 13:05 jeromekelleher

Yes... but I guess throwing an error saying "you should run tables.compute_mutation_parents()`` would mitigate this?

Aha -- I didn't know this existed!

The tree sequence initialization has a flag for "build indexes". Perhaps "compute mutation parents" could be a new one?

molpopgen avatar May 23 '23 14:05 molpopgen

Yes, good idea

jeromekelleher avatar May 23 '23 16:05 jeromekelleher

Fixed in #3212

benjeffery avatar Sep 23 '25 16:09 benjeffery