tskit icon indicating copy to clipboard operation
tskit copied to clipboard

tree.tmrca errors when given two nodes having no mrca

Open petrelharp opened this issue 4 years ago • 4 comments

This script:

import tskit
t = tskit.TableCollection(sequence_length=1)
t.nodes.add_row(flags=1, time=0)
t.nodes.add_row(flags=1, time=0)

ts = t.tree_sequence()

tr = ts.first()
print("MRCA", tr.mrca(0, 1))
print("TMRCA", tr.tmrca(0, 1))

produces

MRCA -1
Traceback (most recent call last):
  File "/home/peter/debug/tskit/no_root.py", line 10, in <module>
    print("TMRCA", tr.tmrca(0, 1))
  File "/home/peter/.local/lib/python3.9/site-packages/tskit-0.4.0-py3.9-linux-x86_64.egg/tskit/trees.py", line 1056, in tmrca
    return self.get_time(self.get_mrca(u, v))
  File "/home/peter/.local/lib/python3.9/site-packages/tskit-0.4.0-py3.9-linux-x86_64.egg/tskit/trees.py", line 1317, in get_time
    return self.time(u)
  File "/home/peter/.local/lib/python3.9/site-packages/tskit-0.4.0-py3.9-linux-x86_64.egg/tskit/trees.py", line 1330, in time
    return self._ll_tree.get_time(u)
ValueError: Node index out of bounds

... for obvious reasons. I haven't tracked down if this is new behavior or not? A better return value might be Inf, but I wouldn't want to break backwards compatibility. At least a better error message might be produced.

petrelharp avatar Dec 15 '21 00:12 petrelharp

Definitely seems like None or Inf would be better here. I'm guessing it is a -1 being passed off as an array index for Tree somewhere.

molpopgen avatar Dec 15 '21 01:12 molpopgen

I think returning inf is semantically fine and is inline with the time of the virtual root being +inf. We've also defined the branch length of a root as zero, so we have made these corner case exceptions before. Is there a specific calculation you have in mind that these semantics would make simpler @petrelharp? If so, I'm all for the change.

Technically it's a breaking change, but I doubt anyone is depending on the ValueError semantics here.

jeromekelleher avatar Dec 15 '21 08:12 jeromekelleher

Is there a specific calculation you have in mind that these semantics would make simpler @petrelharp?

No - I was just iterating over a bunch of TMRCAs and got a "unknown node" error, which was confusing because neither of the nodes I asked about were missing. If we really wanted to be correct maybe we'd return None when one or both nodes are missing data, and Inf otherwise, but Inf is fine.

petrelharp avatar Dec 15 '21 17:12 petrelharp

It'd be more troublesome to start messing around with missing data I think, as we'd have to start defining different behaviour on sample and non-sample nodes. Just having a time of inf when the mrca is NULL is much simpler.

jeromekelleher avatar Dec 15 '21 17:12 jeromekelleher