Codecov Report

Merging #2135 (aad375b) into main (3631396) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #2135   +/-   ##
=======================================
  Coverage   93.37%   93.37%           
=======================================
  Files          27       27           
  Lines       25591    25591           
  Branches     1163     1163           
=======================================
  Hits        23895    23895           
  Misses       1666     1666           
  Partials       30       30

Flag	Coverage Δ
c-tests	`92.33% <ø> (ø)`
lwt-tests	`89.05% <ø> (ø)`
python-c-tests	`72.11% <ø> (ø)`
python-tests	`98.89% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 3631396...aad375b. Read the comment docs.

Feb 22 '22 17:02 codecov[bot]

This is tricky. We definitely want to keep a cache of some sort because we evaluate summary(u) many more times otherwise (not just twice I realise, we also evaluate summar(u) twice for each branch we traverse up the tree as we add and remove edges) and the summary_func is one of the major contributors to run time as it is.

What we could do is have a slightly more intelligent cache where we only store the values of summary(u) for nodes that are currently in the tree, which would reduce the storage a bit (at the cost of making things a bit more complicated).

But then, for a coalescent simulation we only have O(n + t) nodes, so we're only really going to benefit from this, except for relatively small n value (say) n=1000 and t= 10^6. The reality is that we need O(n^3) storage to do this because we need to cache O(n^2) space for each node currently in the tree, unless we want to evaluate the O(n^2) summary function quite a few more times. Although, I guess the actual evaluation of branch_length * summary[u] is also O(n^2), so maybe here's not much difference really.

Any thoughts @petrelharp?

Feb 23 '22 15:02 jeromekelleher

Here's some thoughts:

The cacheing is saving us only a constant multiple on the number of summary_func evaluations, since as you say, it's updating it twice for each added / removed edge; but we're going to have to evalute summary_func again for all such updated nodes anyways. However, this constant might be substantial.
So, we could avoid cacheing entirely if we were instead clever about which nodes needed to be updated, and in which order. For instance, if we kept track of which nodes needed their contributions updated and then evaluated summary_func for those just once after edges had been added and removed, then we'd have nearly the minimal number of requried summary_func calls (we'd have one extra call per node - for instance, we'd call summary_func(u) when u is added and also when it is removed).
I don't know how calling summary_func compares to looking up the pre-computed value? The later presumably costs something, especially when the table of such things gets big?
An alternative might be something along the lines of this algorithm - I think it'd be possible to compute the n^2 covariance matrix by cacheing at each node a vector of length n (something like the covariance of that node with all samples over some set of SNPs) and then propagating these length-n vectors downwards (as described in that link) until they get to the samples; the length-n vector at each sample would then be the rows of the matrix. I'm unclear right now how this would compare to what we're talking about, or if if would generalize beyond covariance.

Feb 23 '22 21:02 petrelharp

tskit
tskit copied to clipboard

Makes stats cache optional

Codecov Report

tskit tskit copied to clipboard

Makes stats cache optional

Codecov Report

tskit
tskit copied to clipboard