tskit
tskit copied to clipboard
Makes stats cache optional
Alternative approach for #1937
Codecov Report
Merging #2135 (aad375b) into main (3631396) will not change coverage. The diff coverage is
n/a.
@@ Coverage Diff @@
## main #2135 +/- ##
=======================================
Coverage 93.37% 93.37%
=======================================
Files 27 27
Lines 25591 25591
Branches 1163 1163
=======================================
Hits 23895 23895
Misses 1666 1666
Partials 30 30
| Flag | Coverage Δ | |
|---|---|---|
| c-tests | 92.33% <ø> (ø) |
|
| lwt-tests | 89.05% <ø> (ø) |
|
| python-c-tests | 72.11% <ø> (ø) |
|
| python-tests | 98.89% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 3631396...aad375b. Read the comment docs.
This is tricky. We definitely want to keep a cache of some sort because we evaluate summary(u) many more times otherwise (not just twice I realise, we also evaluate summar(u) twice for each branch we traverse up the tree as we add and remove edges) and the summary_func is one of the major contributors to run time as it is.
What we could do is have a slightly more intelligent cache where we only store the values of summary(u) for nodes that are currently in the tree, which would reduce the storage a bit (at the cost of making things a bit more complicated).
But then, for a coalescent simulation we only have O(n + t) nodes, so we're only really going to benefit from this, except for relatively small n value (say) n=1000 and t= 10^6. The reality is that we need O(n^3) storage to do this because we need to cache O(n^2) space for each node currently in the tree, unless we want to evaluate the O(n^2) summary function quite a few more times. Although, I guess the actual evaluation of branch_length * summary[u] is also O(n^2), so maybe here's not much difference really.
Any thoughts @petrelharp?
Here's some thoughts:
- The cacheing is saving us only a constant multiple on the number of summary_func evaluations, since as you say, it's updating it twice for each added / removed edge; but we're going to have to evalute summary_func again for all such updated nodes anyways. However, this constant might be substantial.
- So, we could avoid cacheing entirely if we were instead clever about which nodes needed to be updated, and in which order. For instance, if we kept track of which nodes needed their contributions updated and then evaluated summary_func for those just once after edges had been added and removed, then we'd have nearly the minimal number of requried summary_func calls (we'd have one extra call per node - for instance, we'd call summary_func(u) when u is added and also when it is removed).
- I don't know how calling summary_func compares to looking up the pre-computed value? The later presumably costs something, especially when the table of such things gets big?
- An alternative might be something along the lines of this algorithm - I think it'd be possible to compute the n^2 covariance matrix by cacheing at each node a vector of length n (something like the covariance of that node with all samples over some set of SNPs) and then propagating these length-n vectors downwards (as described in that link) until they get to the samples; the length-n vector at each sample would then be the rows of the matrix. I'm unclear right now how this would compare to what we're talking about, or if if would generalize beyond covariance.