tskit
tskit copied to clipboard
Memory consumption of tree sequence statistics
When the output dimension of a statistic is large, so is the memory consumption.
The following example calculates the pairwise distance matrix for all samples from a single tree and requires a bit over 7GB of RAM for a small number of samples (1000).
import msprime
import numpy as np
import tskit
def pairwise_distance_branch(ts: tskit.TreeSequence, samples: np.array):
sample_sets = []
indexes = []
for i in range(len(samples)):
sample_sets.append([i])
for j in range(i + 1, len(samples)):
indexes.append((i, j))
div = ts.divergence(sample_sets, indexes=indexes, mode="branch")
return div
print(msprime.__version__)
print(msprime.tskit.__version__)
ts = msprime.simulate(1000, random_seed=12345)
div = pairwise_distance_branch(ts, [i for i in ts.samples()])
The versions are: 0.7.4 0.2.3
From talking to @petrelharp about this, it appears that some/most of the RAM use may be attributable to some memoization during the calculation that (he feels) may not be necessary?