tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Memory consumption of tree sequence statistics

Open molpopgen opened this issue 5 years ago • 7 comments

When the output dimension of a statistic is large, so is the memory consumption.

The following example calculates the pairwise distance matrix for all samples from a single tree and requires a bit over 7GB of RAM for a small number of samples (1000).

import msprime
import numpy as np
import tskit


def pairwise_distance_branch(ts: tskit.TreeSequence, samples: np.array):
    sample_sets = []
    indexes = []
    for i in range(len(samples)):
        sample_sets.append([i])
        for j in range(i + 1, len(samples)):
            indexes.append((i, j))

    div = ts.divergence(sample_sets, indexes=indexes, mode="branch")
    return div


print(msprime.__version__)
print(msprime.tskit.__version__)
ts = msprime.simulate(1000, random_seed=12345)
div = pairwise_distance_branch(ts, [i for i in ts.samples()])

The versions are: 0.7.4 0.2.3

From talking to @petrelharp about this, it appears that some/most of the RAM use may be attributable to some memoization during the calculation that (he feels) may not be necessary?

molpopgen avatar May 26 '20 00:05 molpopgen

Here's the memory; the algorithm is described here. I think that summary can be recomputed each time it is needed, which would alleviate this problem, although would probably cause a drop in performance.

petrelharp avatar May 26 '20 05:05 petrelharp

Well, we could add an option to either store the intermediate results or recompute them. I don't think this would add too much complexity. That is, assuming there is a significant difference in performance. If not, then we should get rid of the stored results. Should be easy enough to do a quick test?

jeromekelleher avatar May 26 '20 08:05 jeromekelleher

Is this still an open issue? I think we should probably close it unless someone is intending to follow it up.

jeromekelleher avatar Aug 27 '20 13:08 jeromekelleher

It does need addressing by someone who knows the code. There are certain situations where one ends up with unnecessary runtime crashes. Unfortunately, I have a hard time with the stats code.

On Thu, Aug 27, 2020, 6:06 AM Jerome Kelleher [email protected] wrote:

Is this still an open issue? I think we should probably close it unless someone is intending to follow it up.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tskit-dev/tskit/issues/647#issuecomment-681933685, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQ6OHZYVCXU3TQRH3NKR2DSCZKK5ANCNFSM4NJ3KGJQ .

molpopgen avatar Aug 27 '20 13:08 molpopgen

OK, let's keep it open.

jeromekelleher avatar Aug 27 '20 13:08 jeromekelleher

I'm going to close this because we're addressing the general problem with pairwise statistics using a different framework now (starting from the divergence matrix in #2736)

The short version I think is that the stats API assumes that we have a relatively small number of statistics, and if we have a large number of related statistics to compute then other approaches should be used.

jeromekelleher avatar Jul 07 '23 10:07 jeromekelleher

I actually think it's still worth running those tests - what you say is true for pairwise stats, but there's also clasess of stats with output equal to the number of samples that would be nice to do this way; e.g. "relatedness matrix times a vector".

petrelharp avatar Jul 09 '23 04:07 petrelharp

Closed in #2980.

petrelharp avatar Sep 25 '24 03:09 petrelharp