tskit Memory consumption of tree sequence statistics

Memory consumption of tree sequence statistics

Open molpopgen opened this issue 4 years ago • 7 comments

When the output dimension of a statistic is large, so is the memory consumption.

The following example calculates the pairwise distance matrix for all samples from a single tree and requires a bit over 7GB of RAM for a small number of samples (1000).

import msprime
import numpy as np
import tskit


def pairwise_distance_branch(ts: tskit.TreeSequence, samples: np.array):
    sample_sets = []
    indexes = []
    for i in range(len(samples)):
        sample_sets.append([i])
        for j in range(i + 1, len(samples)):
            indexes.append((i, j))

    div = ts.divergence(sample_sets, indexes=indexes, mode="branch")
    return div


print(msprime.__version__)
print(msprime.tskit.__version__)
ts = msprime.simulate(1000, random_seed=12345)
div = pairwise_distance_branch(ts, [i for i in ts.samples()])

The versions are: 0.7.4 0.2.3

From talking to @petrelharp about this, it appears that some/most of the RAM use may be attributable to some memoization during the calculation that (he feels) may not be necessary?

May 26 '20 00:05 molpopgen

tskit tskit copied to clipboard

Memory consumption of tree sequence statistics

tskit
tskit copied to clipboard