tskit
tskit copied to clipboard
Use divergence_matrix for downstream statistics
I think we can rephrase at least genetic_relatedness (aka eGRM) in terms of divergence_matrix, which should substantially improve performance (although waiting for #2779 which is needed for decent site-mode performance).
Can we transform the divergence matrix into genetic_relatedness efficiently in Python (i.e. using numpy) or do we need C code for this @petrelharp?
Are there other stats we can do this for?
We'd need to consider the compatibility issues raise, of course. For one, we'll be computing something slightly different in site mode after this, I guess?
Let's see - we talked through how to do this somewhere; the missing piece is you need the function that computes, for each node, the total area from the node to the root (that's in branch mode; for site it's the number of mutations). Call this derived; then relatedness[i,j] = derived[i] + derived[j] - divergence[i,j].
HOWEVER, your point about back mutations is an important one. I think that we argued that if divergence matrix and divergence gave slightly different answers that was OK; if that is true then relatedness_matrix and relatedness could also give slightly different answers?
Ah yes, that makes sense. Given we need to compute derived per window it's probably simpler to do in c rather than try to come up with numpy tricks.
So, we create a C function genetic_relatedness_matrix, following the pattern of divergence_matrix, and expose this to python in the standard way?
I think having the *_matrix functions have slightly different semantics is fine, we just need to document it clearly
This was done in #2823 and see #1623 for documentation.