tskit
tskit copied to clipboard
Add tree sequence initialization flag for "compute mutation parents"
When I try to construct a tree sequence through a mutation table,
- time must either be UNKNOWN_TIME (a NAN value which indicates the time is unknown) or be a finite value which is greater or equal to the mutation node’s time, less than the node above the mutation’s time and equal to or less than the time of the parent mutation if this mutation has one. If one mutation on a site has UNKNOWN_TIME then all mutations at that site must, as a mixture of known and unknown is not valid.
- mutation must be sorted by site ID
in the mutation requirements (https://tskit.dev/tskit/docs/stable/data-model.html#sec-mutation-requirements) are detected as errors. However, codes such as
ts = tskit.Tree.generate_comb(4, span=10).tree_sequence
tables = ts.dump_tables()
tables.sites.add_row(0, "A")
tables.mutations.add_row(site=0, node=3, derived_state="T")
tables.mutations.add_row(site=0, node=5, derived_state="T")
ts = tables.tree_sequence()
(Mutation ID 0 occurs at a later time than Mutation ID 1)
and
ts = tskit.Tree.generate_comb(4, span=10).tree_sequence
tables = ts.dump_tables()
tables.sites.add_row(0, "A")
tables.mutations.add_row(site=0, node=0, derived_state="T")
tables.mutations.add_row(site=0, node=0, derived_state="G")
ts = tables.tree_sequence()
(Mutation ID 1 should have Mutation ID 0 as its parent, though ts.mutation(1) has parent=-1 instead of parent=0)
won't be detected as errors, even though they are violating the mutation requirements of tree sequence data. I would appreciate any insights on this topic.
Excellent question @daikitag! This is a limitation in the input validation in for mutations, which we are increasingly coming to agree is a bug.
The second issue example is the same issue as #2732
I'm not sure the first one is actually a bug - can you paste in the actual tree here so we can see it (print(ts.draw_text()) would be the most helpful output
The first seems like a bug in light of:
If another mutation occurs on the tree above the mutation in question, its ID must be listed as the parent.
But is the text vague here? Are the requirements saying you should either use parent == NULL everywhere OR you have to do the "futhermore" bit?
I got the same output as @molpopgen posted, and I think it is violating
- when there are multiple mutations per site, mutations should be ordered by decreasing time, if known, and parent mutations must occur before their children
because mutation at node 3 in site 0 cannot occur before mutation at node 5 in site 0.
I just have a suggestion, but would it be possible for us to slightly modify the codes such that an error appears when there are multiple mutations with parent=-1 in the same edge, etc, because some algorithms use parent information in tree traversal?
Hmm, this is a bit of a mess - I think we'll just have to write some code to do the actual requirements checking, and then see how much code we'd break by using it, and whether there's any noticeable perf consequences.
It seems like we may need something to populate the mutation parent field? It will be a big hit in the API ergonomics to have to track all of that, especially when making example tree sequences "manually".
Yes... but I guess throwing an error saying "you should run tables.compute_mutation_parents()`` would mitigate this?
Yes... but I guess throwing an error saying "you should run tables.compute_mutation_parents()`` would mitigate this?
Aha -- I didn't know this existed!
The tree sequence initialization has a flag for "build indexes". Perhaps "compute mutation parents" could be a new one?
Yes, good idea
Fixed in #3212