tskit Low level pair coalescence counts

Low level extension of #2915

Apr 18 '24 00:04 nspope

Codecov Report

Attention: Patch coverage is 95.28302% with 15 lines in your changes missing coverage. Please review.

Project coverage is 89.65%. Comparing base (beafeba) to head (ca06cff).

Files	Patch %	Lines
c/tskit/trees.c	93.50%	5 Missing and 8 partials :warning:
python/_tskitmodule.c	98.09%	1 Missing and 1 partial :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2932      +/-   ##
==========================================
+ Coverage   89.62%   89.65%   +0.02%     
==========================================
  Files          29       29              
  Lines       30170    30393     +223     
  Branches     5867     5901      +34     
==========================================
+ Hits        27041    27249     +208     
- Misses       1790     1796       +6     
- Partials     1339     1348       +9

Flag	Coverage Δ
c-tests	`86.31% <93.68%> (+0.10%)`	:arrow_up:
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`88.86% <98.09%> (+0.13%)`	:arrow_up:
python-tests	`99.00% <100.00%> (-0.02%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
c/tskit/core.c	`95.63% <100.00%> (+0.03%)`	:arrow_up:
c/tskit/core.h	`100.00% <ø> (ø)`
python/tskit/trees.py	`98.75% <100.00%> (-0.05%)`	:arrow_down:
python/_tskitmodule.c	`88.86% <98.09%> (+0.13%)`	:arrow_up:
c/tskit/trees.c	`90.55% <93.50%> (+0.13%)`	:arrow_up:

Apr 18 '24 01:04 codecov[bot]

I'd like to generalize this algorithm slightly:

Currently, the output is either a num_windows by num_nodes array (which is very large), or a num_windows by num_time_windows array where the counts are summed within time windows.
Conceptually, the "nodes" output gives the empirical distribution of pair coalescence times in windows across the genome. That is, for each window we have a vector of RVs (node times) and a vector of weights (pair coalescence counts).
From this distributional viewpoint, there's lots of useful things that may be calculated: the empirical CDF, quantiles, moments, coalescence rates, etc. (of which the "sum in time windows" option of the current implementation is one special case)

So, I think it'd be useful to take the current algorithm, and have it apply a summary function at the end of each window. This would let us calculate any useful summary statistic without having to create a potentially humongous array (windows by nodes by indexes) as an intermediate.

The API would stay the same-- later on, we could add named methods for various summary statistics, and potentially eventually expose a "general summary stat" interface, like is done for the other statistics.

Apr 18 '24 03:04 nspope

The docs build is failing with numpy 2 issues on import msprime, after a rebase-- is there something I can do to fix this?

Jul 25 '24 01:07 nspope

This is probably a package cache issue @benjeffery?

Jul 25 '24 08:07 jeromekelleher

I bumped the cache version to no avail-- however, it looks like requirements/CI-docs/requirements.txt has msprime pinned to 1.3.1?

Jul 25 '24 18:07 nspope

Yup, pinning msprime to the latest version in requirements/* seems to have done the trick.

Jul 25 '24 19:07 nspope

Alright, I think this is ready for a look @petrelharp and @jeromekelleher. The prototypes for TreeSequence methods are prefixed by proto in the tests. Currently, the workhorse is _pair_coalescence_stat which applies a reduction function summary_func(nodes_weight, nodes_time, **summary_func_kwargs) at the end of each window.

I think the next steps are:

Implement _pair_coalescence_stat in C, and allow it to take either (A) a custom reduction function that is a python callback; (B) a C function that also takes in a struct with extra arguments (for the kwargs).
Implement the reduction functions needed to get coalescence rates in C
Make a python API+tests for coalescence rates / counts (already mostly done)
Make a python API for custom reduction functions (lower priority)

Does this seem about right? Alternatively I suppose we could have the reduction functions for (2) be python callbacks? But there might a performance penalty with lots of windows.

Jul 26 '24 01:07 nspope

Hmm, tests are passing on Ubuntu but evidently not on MacOS or Windows. ssh'ing onto the runner, I can get

malloc: *** error for object 0x7fe5ea402a90: pointer being freed was not allocated

on MacOS, happening in one of the tsk_safe_free calls.

Jul 30 '24 02:07 nspope

I would strongly advise writing some simple C tests that exercise all reachable code paths, and running it through valgrind. This is a much more effective way of tracking down these kinds of bugs. I can help with this, if it's not clear how to start?

Jul 30 '24 08:07 jeromekelleher

Makes sense @jeromekelleher, I can take a crack at it. Just so I'm understanding the scope, is the goal to recreate the entire test suite from python in c/tests (correctness tests, etc)? Or something more minimal?

Jul 30 '24 17:07 nspope

Something minimal. You're just looking to exercise all code paths so that valgrind can see them.

Jul 30 '24 18:07 jeromekelleher

That did the trick ... is there a good way to trigger OOM errors after the mallocs, to get patterns like

if (whatever == NULL) {
  ret = TSK_ERR_NO_MEMORY;
  goto out;
}

covered? It looks like these aren't covered in the existing tests, so presumably not?

Jul 30 '24 23:07 nspope

The only substantial coverage gap is in _tskitmodule.c: https://app.codecov.io/gh/tskit-dev/tskit/pull/2932/blob/python/_tskitmodule.c#L9924 where it looks like all execution paths ends in the check,

    if (TreeSequence_check_state(self) != 0) {
        goto out;
    }

however, this can't be the only execution path, because otherwise ll_tree_sequence.pair_coalescence_counts wouldn't return anything and none of the python test suite would pass.

Is there something else needed to get the CPython parts covered?

Jul 31 '24 01:07 nspope

Ah, nevermind-- I see that coverage for CPython is getting read off of test_lowlevel.py

Jul 31 '24 04:07 nspope

is there a good way to trigger OOM errors after the mallocs, to get patterns like

We've been ignoring these here, as they are very tricky to trigger. We did it in the kastore C tests by running with a patched malloc which fails after n allocs, but it's a lot of work for a pretty marginal benefit.

Jul 31 '24 10:07 jeromekelleher

Thanks @jeromekelleher! All is clear to me except how to handle the special case where time windows are nodes (which is an important one). I left some comments above describing the issue. Briefly, I'm detecting the "nodes" case via a zero-length time windows array. This can't be replaced by an array of unique times in the tree sequence, because we want to preserve node identity even if its age is not unique, and also the order of nodes in the output array; and both these get lost when we sort nodes into (necessarily unique) time windows.

I suppose an alternative (to using a zero-length time windows array) is to use options to store a flag which tells us we are in nodes mode (like is done for the general stats). What do you think is the cleanest option here?

Jul 31 '24 17:07 nspope

Getting another random windows crash in the CI, but valgrind isn't showing any issues. I traced it back to commit 6287b84 which switched from using int to double, and switched to using the row access macro.

Aug 01 '24 04:08 nspope

I really would prefer if we could back out of doing the three different options here and focus on the simpler case of just a fixed set of required time windows first. There's a lot of complexity here already, and I want to make sure we catch all of the boring C level stuff before we have to think about that extra level of trickiness.

So, can we leave that bit for a follow up, and focus on getting 100% test coverage on the bits that should be covered? That means more tests in C, and breaking up the Python C tests like I suggested.

Aug 01 '24 08:08 jeromekelleher

So I should pytest.mark.skip any tests using the "nodes" output for the time being? These are the bulk of the tests, which is why I tried to support that mode.

I'm working on extending the test_lowlevel case to get the CPython covered, but I think all the C code paths are now covered aside from OOM.

Aug 01 '24 16:08 nspope

Here's memory consumption over time:

looks to me like it's stabilizing -- is this what you were after @jeromekelleher ?

Code here:

monitor mem usage with ps

#!/usr/bin/env bash

echo '
import tskit
import numpy as np
import msprime
ts = msprime.sim_ancestry(
  1000, sequence_length=1e5, recombination_rate=1e-8, population_size=1e4
)
time_windows = np.array([0, np.max(ts.nodes_time)/2, np.inf])
for _ in range(100000):
  ts.pair_coalescence_counts(time_windows=time_windows)
' >del_memory_monitor.py

rm -f del_memory_monitor.log
python del_memory_monitor.py &
for i in {1..1000}; do
  ps -eo cmd,%mem,vsz,rsz | grep "python del_memory_monitor" >> del_memory_monitor.log
  sleep 0.0001
done 
rm del_memory_monitor.py

echo '
import numpy as np
import matplotlib.pyplot as plt
vsz, rsz = np.loadtxt("del_memory_monitor.log", usecols=[2,3]).T
step = np.arange(vsz.size) * 0.0001
plt.plot(step, vsz / np.max(vsz), "-", label="VSZ")
plt.plot(step, rsz / np.max(rsz), "-", label="RSZ")
plt.ylabel("Memory usage / max(memory usage)")
plt.xlabel("Wall time (relative)")
plt.xscale("log")
plt.legend()
plt.savefig("del_memory_monitor.png")
' | python
rm del_memory_monitor.log

Aug 01 '24 19:08 nspope

Nice - did you try it with some decrefs commented out?

Aug 01 '24 19:08 jeromekelleher

this is with results_array XDECREF commented out. Looks qualitatively similar ... ?

Aug 01 '24 19:08 nspope

Hmm, you need bigger examples then. You should be leaking memory heavily there.

Aug 01 '24 20:08 jeromekelleher

Ah, the output array was too small with only 2 time windows.

Here's with XDECREF commented out:

Here's with XDECREF:

looks sane to me. I'm waiting a bit before starting monitoring, here, to skip the initial overhead associated with imports, etc.

memory monitoring code for real

#!/usr/bin/env bash

echo '
import tskit
import numpy as np
import msprime
ts = msprime.sim_ancestry(
  100, sequence_length=1e6, recombination_rate=1e-8, population_size=1e4
)
time_windows = np.append(np.unique(ts.nodes_time), np.inf)
windows = np.linspace(0, ts.sequence_length, 100)
sample_sets = list(ts.samples())
sample_set_sizes = [ts.num_samples]
indexes = [(0, 0)]
for _ in range(100000):
    ts.ll_tree_sequence.pair_coalescence_counts(
        time_windows=time_windows, 
        windows=windows, 
        indexes=indexes, 
        sample_sets=sample_sets,
        sample_set_sizes=sample_set_sizes,
        span_normalise=False,
    )
' >memory_monitor.py

python memory_monitor.py &
sleep 1
for i in {1..1000}; do
  ps -eo cmd,%mem,vsz,rsz | grep "python memory_monitor"
  sleep 0.1
done >memory_monitor.log

echo '
import numpy as np
import matplotlib.pyplot as plt
vsz, rsz = np.loadtxt("memory_monitor.log", usecols=[2,3]).T
step = np.arange(vsz.size) * 0.1
plt.plot(step, vsz / np.max(vsz), "-", label="VSZ")
plt.plot(step, rsz / np.max(rsz), "-", label="RSZ")
plt.ylabel("Memory usage (divided by max)")
plt.xlabel("Wall time (relative to start)")
plt.legend()
plt.savefig("memory_monitor.png")
' | python

Aug 01 '24 21:08 nspope

Maybe the thing to do with time windows is change the input in ts.ll_tree_sequence.pair_coalescence_counts from time_windows to integer vector node_bin, that maps nodes to output bins (which could be time bins, or any other delineation). Then ts.pair_coalescence_counts can internally do the binning for a given set of time breakpoints, which allows all three cases of time windows (nodes, unique time points, time intervals) to be passed to the low-level routine.

Aug 02 '24 00:08 nspope

Nice, that sounds like a good idea. Do you want to push forward with that, or merge this much and iterate? (I haven't reviewed again, will need to take another good look)

Aug 02 '24 07:08 jeromekelleher

I gave it a shot, but it looks like I'm getting stochastic crashes on Windows. Totally perplexed as to why, because the semantics are very similar to the last commit, which worked.

(It could be because I had skipped a bunch of tests in the last commit?)

Aug 02 '24 16:08 nspope

Working now-- bad malloc (d'oh). This is ready for another look whenever you have time @jeromekelleher .

Here's mem usage once more, with the time_windows -> nodes_output_map change:

Aug 03 '24 04:08 nspope

Done!

It occurs to me that the order of dimensions here -- (windows, time_windows, indexes) -- should be consistent with the (forthcoming) time windowed stats. @petrelharp what order did you guys settle on?

Aug 05 '24 23:08 nspope

(windows, time_windows, indexes)

Same!

Aug 06 '24 23:08 petrelharp