How to skip empty regions at the start and end of a tree seq
We are about to make a change to tsinfer so that the regions before the last site and after the last site in a tree sequence have no topology. I imagine that this could be quite confusing for people looping through the trees, as sometimes (if the first site is not at position 0), the first tree will have num_roots == num_samples, and will not have any topology.
I suspect it would be useful to be able to skip these empty regions at the start and end of a tree sequence when using the trees() iterator. Have we any suggestions for a nice interface to do this? At the moment I'm doing something like this:
for tree in output_ts.trees():
if tree.num_edges == 0 and (tree.index == 0 or tree.index == ts.num_trees - 1):
continue
...
But I think that might be a bit advanced for normal users to have to do regularly?
This also makes me think that a tree.first and tree.last boolean attribute would be neat. In particular, tree.last would avoid having to do tree.index == tree.tree_sequence.num_trees - 1
Good idea, how about
for tree in ts.trees(skip_missing=True):
...
we define a tree as missing if tree.num_roots == ts.num_samples. This would also take out centromeres etc, not just the teleomeres (if they've been chopped out of the ts).
This also makes me thing that a tree.first and tree.last boolean attribute would be neat. In particular, tree.last would avoid having to do tree.index == tree.tree_sequence.num_trees - 1
This would clash with seeking methods, but I don't think we need them anyway if we use skip_missing as suggested here.
This also makes me thing that a tree.first and tree.last boolean attribute would be neat. In particular, tree.last would avoid having to do tree.index == tree.tree_sequence.num_trees - 1
This would clash with seeking methods, but I don't think we need them anyway if we use
skip_missingas suggested here.
Sure. I was simply thinking of an attribute that returned True if the tree was the last tree, as a shorthand (I don't think this would clash with the seeking methods, but it's only a syntactic nicety)
@property
def last(self)
return self.index == self.tree_sequence.num_trees - 1
we define a tree as missing if tree.num_roots == ts.num_samples
Actually, I think this particular definition is a bad idea, because of the root_threshold setting. If that's set to 2, the test will fail. Perhaps it's better to look for tree.num_edges == 0? That won't skip trees with entirely dead branches, but I don't think we would want to do that anyway.
Given that it's not obvious, I suspect a convenience function, tree.is_empty would be helpful. then we can document what we mean by an empty tree, and use a standardised definition everywhere.
tree.num_edges = 0 is a good definition, and an is_empty function would be useful.
We decided at a group meeting that in the absence of current user demand, we can suggest the following in documentation, which is possibly clearer than a specific parameter anyway
for tree in ts.trees():
if tree.num_edges == 0:
continue
# Do stuff
Note that num_edges is preferable here as it is pre-calculated, total_branch_length involves a traversal.