tskit
tskit copied to clipboard
Memory consumption of tree sequence statistics
When the output dimension of a statistic is large, so is the memory consumption.
The following example calculates the pairwise distance matrix for all samples from a single tree and requires a bit over 7GB of RAM for a small number of samples (1000).
import msprime
import numpy as np
import tskit
def pairwise_distance_branch(ts: tskit.TreeSequence, samples: np.array):
sample_sets = []
indexes = []
for i in range(len(samples)):
sample_sets.append([i])
for j in range(i + 1, len(samples)):
indexes.append((i, j))
div = ts.divergence(sample_sets, indexes=indexes, mode="branch")
return div
print(msprime.__version__)
print(msprime.tskit.__version__)
ts = msprime.simulate(1000, random_seed=12345)
div = pairwise_distance_branch(ts, [i for i in ts.samples()])
The versions are: 0.7.4 0.2.3
From talking to @petrelharp about this, it appears that some/most of the RAM use may be attributable to some memoization during the calculation that (he feels) may not be necessary?
Here's the memory; the algorithm is described here. I think that summary can be recomputed each time it is needed, which would alleviate this problem, although would probably cause a drop in performance.
Well, we could add an option to either store the intermediate results or recompute them. I don't think this would add too much complexity. That is, assuming there is a significant difference in performance. If not, then we should get rid of the stored results. Should be easy enough to do a quick test?
Is this still an open issue? I think we should probably close it unless someone is intending to follow it up.
It does need addressing by someone who knows the code. There are certain situations where one ends up with unnecessary runtime crashes. Unfortunately, I have a hard time with the stats code.
On Thu, Aug 27, 2020, 6:06 AM Jerome Kelleher [email protected] wrote:
Is this still an open issue? I think we should probably close it unless someone is intending to follow it up.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tskit-dev/tskit/issues/647#issuecomment-681933685, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQ6OHZYVCXU3TQRH3NKR2DSCZKK5ANCNFSM4NJ3KGJQ .
OK, let's keep it open.
I'm going to close this because we're addressing the general problem with pairwise statistics using a different framework now (starting from the divergence matrix in #2736)
The short version I think is that the stats API assumes that we have a relatively small number of statistics, and if we have a large number of related statistics to compute then other approaches should be used.
I actually think it's still worth running those tests - what you say is true for pairwise stats, but there's also clasess of stats with output equal to the number of samples that would be nice to do this way; e.g. "relatedness matrix times a vector".
Closed in #2980.