tskit icon indicating copy to clipboard operation
tskit copied to clipboard

How to skip empty regions at the start and end of a tree seq

Open hyanwong opened this issue 3 years ago • 6 comments

We are about to make a change to tsinfer so that the regions before the last site and after the last site in a tree sequence have no topology. I imagine that this could be quite confusing for people looping through the trees, as sometimes (if the first site is not at position 0), the first tree will have num_roots == num_samples, and will not have any topology.

I suspect it would be useful to be able to skip these empty regions at the start and end of a tree sequence when using the trees() iterator. Have we any suggestions for a nice interface to do this? At the moment I'm doing something like this:

for tree in output_ts.trees():
    if tree.num_edges == 0 and (tree.index == 0 or tree.index == ts.num_trees - 1):
        continue
    ...

But I think that might be a bit advanced for normal users to have to do regularly?

hyanwong avatar Oct 09 '22 19:10 hyanwong

This also makes me think that a tree.first and tree.last boolean attribute would be neat. In particular, tree.last would avoid having to do tree.index == tree.tree_sequence.num_trees - 1

hyanwong avatar Oct 09 '22 19:10 hyanwong

Good idea, how about

for tree in ts.trees(skip_missing=True):
    ...

we define a tree as missing if tree.num_roots == ts.num_samples. This would also take out centromeres etc, not just the teleomeres (if they've been chopped out of the ts).

jeromekelleher avatar Oct 10 '22 06:10 jeromekelleher

This also makes me thing that a tree.first and tree.last boolean attribute would be neat. In particular, tree.last would avoid having to do tree.index == tree.tree_sequence.num_trees - 1

This would clash with seeking methods, but I don't think we need them anyway if we use skip_missing as suggested here.

jeromekelleher avatar Oct 10 '22 06:10 jeromekelleher

This also makes me thing that a tree.first and tree.last boolean attribute would be neat. In particular, tree.last would avoid having to do tree.index == tree.tree_sequence.num_trees - 1

This would clash with seeking methods, but I don't think we need them anyway if we use skip_missing as suggested here.

Sure. I was simply thinking of an attribute that returned True if the tree was the last tree, as a shorthand (I don't think this would clash with the seeking methods, but it's only a syntactic nicety)

@property
def last(self)
    return self.index == self.tree_sequence.num_trees - 1

hyanwong avatar Oct 10 '22 07:10 hyanwong

I meant it would clash with the method called first

Anyway, like I say, I don't think we need it.

jeromekelleher avatar Oct 10 '22 09:10 jeromekelleher

I meant it would clash with the method called first

Ah, ISWYM - I forgot that existed.

hyanwong avatar Oct 10 '22 09:10 hyanwong

we define a tree as missing if tree.num_roots == ts.num_samples

Actually, I think this particular definition is a bad idea, because of the root_threshold setting. If that's set to 2, the test will fail. Perhaps it's better to look for tree.num_edges == 0? That won't skip trees with entirely dead branches, but I don't think we would want to do that anyway.

Given that it's not obvious, I suspect a convenience function, tree.is_empty would be helpful. then we can document what we mean by an empty tree, and use a standardised definition everywhere.

hyanwong avatar Nov 17 '22 23:11 hyanwong

tree.num_edges = 0 is a good definition, and an is_empty function would be useful.

jeromekelleher avatar Nov 18 '22 09:11 jeromekelleher

We decided at a group meeting that in the absence of current user demand, we can suggest the following in documentation, which is possibly clearer than a specific parameter anyway

for tree in ts.trees():
    if tree.num_edges == 0:
        continue
    # Do stuff

hyanwong avatar Jan 10 '23 13:01 hyanwong

Note that num_edges is preferable here as it is pre-calculated, total_branch_length involves a traversal.

benjeffery avatar Jan 10 '23 14:01 benjeffery