tskit
tskit copied to clipboard
Mutation's "inherited" state
Getting the state that a mutation inherited is fiddly right now, and has been annoying me quite bit recently. We should add a method to the Python object at least to get it, but it would also be nice to have access to it in numpy for reasoning efficiently about reversions and so on.
Does inherited_state feel like the right term here?
The numpy stuff would depend on us making some decisions about vectorised access to the string columns which I guess I should open a seperate issue on.
tables = ts.tables
assert np.all(tables.mutations.derived_state_offset == np.arange(ts.num_mutations + 1))
derived_state = tables.mutations.derived_state.view("S1")
assert np.all(tables.sites.ancestral_state_offset == np.arange(ts.num_sites + 1))
ancestral_state = tables.sites.ancestral_state.view("S1")
# Compute the "inherited state" for each mutation
inherited_state = ancestral_state[ts.mutations_site]
mutations_with_parent = ts.mutations_parent != -1
parent = ts.mutations_parent[mutations_with_parent]
assert np.all(parent >= 0)
inherited_state[mutations_with_parent] = derived_state[parent]
# Check
for mut in ts.mutations():
state0 = ts.site(mut.site).ancestral_state
if mut.parent != -1:
state0 = ts.mutation(mut.parent).derived_state
assert state0 == inherited_state[mut.id].decode()
inherited_state sounds good to me, only other thing I could come up with parent_state but I think that could get confused with the parent node state.
I have wanted this a number of times too, so +1 from me. The name seems reasonable.
Presumably we would need to make sure that compute_mutation_parents() has been run before calculating this.
Yes, it's a bug really that we don't fully validate the mutation parents on tree sequence creation.
Closing for inactivity and labelling "future", please re-open if you plan to work on this.
I definitely want to implement this once the mutation parents bug has been fixed, so I'm assigning it to me, and reopening. I'll at least implement the non-array version.
Note this is straightforwardly added once #3228 as a numpy array mutations_inherited_state, computed like this:
def compute_inherited_state(ts):
derived_state = ts.mutations_derived_state
ancestral_state = ts.sites_ancestral_state
inherited_state = ancestral_state[ts.mutations_site]
mutations_with_parent = ts.mutations_parent != -1
parent = ts.mutations_parent[mutations_with_parent]
inherited_state[mutations_with_parent] = derived_state[parent]
return inherited_state
as this is a derived array, I guess it's debateable whether we should cache it. I think probably not, so we just do something like
@property
def mutations_inherited_state(self):
return compute_inherited_state(self)
I assume I'm not thinking about this properly, butt I'm not totally sure I follow how that works for a bi-allelic site? If there is only 1 mutation at a site, isn't the parent -1?
Yep - and their inherited state is the site's ancestral state. This is really for complicated stuff like SARS-CoV-2 where we get up to 30K mutations per site.
Oh -- I get it, it is the state inherited by the node where a mutation first lands.
Now that https://github.com/tskit-dev/tskit/pull/3212#event-19538220537 and https://github.com/tskit-dev/tskit/pull/3228 are merged, we can finally add this, I think. It would be really useful.
I think we should probably have ts.mutation(N).inherited_state() too, should we?
Yes - but let's get some other projects shipped first