tskit
tskit copied to clipboard
Low level pair coalescence counts
Low level extension of #2915
Codecov Report
Attention: Patch coverage is 95.28302% with 15 lines in your changes missing coverage. Please review.
Project coverage is 89.65%. Comparing base (
beafeba) to head (ca06cff).
| Files | Patch % | Lines |
|---|---|---|
| c/tskit/trees.c | 93.50% | 5 Missing and 8 partials :warning: |
| python/_tskitmodule.c | 98.09% | 1 Missing and 1 partial :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #2932 +/- ##
==========================================
+ Coverage 89.62% 89.65% +0.02%
==========================================
Files 29 29
Lines 30170 30393 +223
Branches 5867 5901 +34
==========================================
+ Hits 27041 27249 +208
- Misses 1790 1796 +6
- Partials 1339 1348 +9
| Flag | Coverage Δ | |
|---|---|---|
| c-tests | 86.31% <93.68%> (+0.10%) |
:arrow_up: |
| lwt-tests | 80.78% <ø> (ø) |
|
| python-c-tests | 88.86% <98.09%> (+0.13%) |
:arrow_up: |
| python-tests | 99.00% <100.00%> (-0.02%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files | Coverage Δ | |
|---|---|---|
| c/tskit/core.c | 95.63% <100.00%> (+0.03%) |
:arrow_up: |
| c/tskit/core.h | 100.00% <ø> (ø) |
|
| python/tskit/trees.py | 98.75% <100.00%> (-0.05%) |
:arrow_down: |
| python/_tskitmodule.c | 88.86% <98.09%> (+0.13%) |
:arrow_up: |
| c/tskit/trees.c | 90.55% <93.50%> (+0.13%) |
:arrow_up: |
I'd like to generalize this algorithm slightly:
- Currently, the output is either a
num_windowsbynum_nodesarray (which is very large), or anum_windowsbynum_time_windowsarray where the counts are summed within time windows. - Conceptually, the "nodes" output gives the empirical distribution of pair coalescence times in windows across the genome. That is, for each window we have a vector of RVs (node times) and a vector of weights (pair coalescence counts).
- From this distributional viewpoint, there's lots of useful things that may be calculated: the empirical CDF, quantiles, moments, coalescence rates, etc. (of which the "sum in time windows" option of the current implementation is one special case)
So, I think it'd be useful to take the current algorithm, and have it apply a summary function at the end of each window. This would let us calculate any useful summary statistic without having to create a potentially humongous array (windows by nodes by indexes) as an intermediate.
The API would stay the same-- later on, we could add named methods for various summary statistics, and potentially eventually expose a "general summary stat" interface, like is done for the other statistics.
The docs build is failing with numpy 2 issues on import msprime, after a rebase-- is there something I can do to fix this?
This is probably a package cache issue @benjeffery?
I bumped the cache version to no avail-- however, it looks like requirements/CI-docs/requirements.txt has msprime pinned to 1.3.1?
Yup, pinning msprime to the latest version in requirements/* seems to have done the trick.
Alright, I think this is ready for a look @petrelharp and @jeromekelleher. The prototypes for TreeSequence methods are prefixed by proto in the tests. Currently, the workhorse is _pair_coalescence_stat which applies a reduction function summary_func(nodes_weight, nodes_time, **summary_func_kwargs) at the end of each window.
I think the next steps are:
-
Implement
_pair_coalescence_statin C, and allow it to take either (A) a custom reduction function that is a python callback; (B) a C function that also takes in a struct with extra arguments (for the kwargs). -
Implement the reduction functions needed to get coalescence rates in C
-
Make a python API+tests for coalescence rates / counts (already mostly done)
-
Make a python API for custom reduction functions (lower priority)
Does this seem about right? Alternatively I suppose we could have the reduction functions for (2) be python callbacks? But there might a performance penalty with lots of windows.
Hmm, tests are passing on Ubuntu but evidently not on MacOS or Windows. ssh'ing onto the runner, I can get
malloc: *** error for object 0x7fe5ea402a90: pointer being freed was not allocated
on MacOS, happening in one of the tsk_safe_free calls.
I would strongly advise writing some simple C tests that exercise all reachable code paths, and running it through valgrind. This is a much more effective way of tracking down these kinds of bugs. I can help with this, if it's not clear how to start?
Makes sense @jeromekelleher, I can take a crack at it. Just so I'm understanding the scope, is the goal to recreate the entire test suite from python in c/tests (correctness tests, etc)? Or something more minimal?
Something minimal. You're just looking to exercise all code paths so that valgrind can see them.
That did the trick ... is there a good way to trigger OOM errors after the mallocs, to get patterns like
if (whatever == NULL) {
ret = TSK_ERR_NO_MEMORY;
goto out;
}
covered? It looks like these aren't covered in the existing tests, so presumably not?
The only substantial coverage gap is in _tskitmodule.c: https://app.codecov.io/gh/tskit-dev/tskit/pull/2932/blob/python/_tskitmodule.c#L9924 where it looks like all execution paths ends in the check,
if (TreeSequence_check_state(self) != 0) {
goto out;
}
however, this can't be the only execution path, because otherwise ll_tree_sequence.pair_coalescence_counts wouldn't return anything and none of the python test suite would pass.
Is there something else needed to get the CPython parts covered?
Ah, nevermind-- I see that coverage for CPython is getting read off of test_lowlevel.py
is there a good way to trigger OOM errors after the mallocs, to get patterns like
We've been ignoring these here, as they are very tricky to trigger. We did it in the kastore C tests by running with a patched malloc which fails after n allocs, but it's a lot of work for a pretty marginal benefit.
Thanks @jeromekelleher! All is clear to me except how to handle the special case where time windows are nodes (which is an important one). I left some comments above describing the issue. Briefly, I'm detecting the "nodes" case via a zero-length time windows array. This can't be replaced by an array of unique times in the tree sequence, because we want to preserve node identity even if its age is not unique, and also the order of nodes in the output array; and both these get lost when we sort nodes into (necessarily unique) time windows.
I suppose an alternative (to using a zero-length time windows array) is to use options to store a flag which tells us we are in nodes mode (like is done for the general stats). What do you think is the cleanest option here?
Getting another random windows crash in the CI, but valgrind isn't showing any issues. I traced it back to commit 6287b84 which switched from using int to double, and switched to using the row access macro.
I really would prefer if we could back out of doing the three different options here and focus on the simpler case of just a fixed set of required time windows first. There's a lot of complexity here already, and I want to make sure we catch all of the boring C level stuff before we have to think about that extra level of trickiness.
So, can we leave that bit for a follow up, and focus on getting 100% test coverage on the bits that should be covered? That means more tests in C, and breaking up the Python C tests like I suggested.
So I should pytest.mark.skip any tests using the "nodes" output for the time being? These are the bulk of the tests, which is why I tried to support that mode.
I'm working on extending the test_lowlevel case to get the CPython covered, but I think all the C code paths are now covered aside from OOM.
Here's memory consumption over time:
looks to me like it's stabilizing -- is this what you were after @jeromekelleher ?
Code here:
monitor mem usage with ps
#!/usr/bin/env bash
echo '
import tskit
import numpy as np
import msprime
ts = msprime.sim_ancestry(
1000, sequence_length=1e5, recombination_rate=1e-8, population_size=1e4
)
time_windows = np.array([0, np.max(ts.nodes_time)/2, np.inf])
for _ in range(100000):
ts.pair_coalescence_counts(time_windows=time_windows)
' >del_memory_monitor.py
rm -f del_memory_monitor.log
python del_memory_monitor.py &
for i in {1..1000}; do
ps -eo cmd,%mem,vsz,rsz | grep "python del_memory_monitor" >> del_memory_monitor.log
sleep 0.0001
done
rm del_memory_monitor.py
echo '
import numpy as np
import matplotlib.pyplot as plt
vsz, rsz = np.loadtxt("del_memory_monitor.log", usecols=[2,3]).T
step = np.arange(vsz.size) * 0.0001
plt.plot(step, vsz / np.max(vsz), "-", label="VSZ")
plt.plot(step, rsz / np.max(rsz), "-", label="RSZ")
plt.ylabel("Memory usage / max(memory usage)")
plt.xlabel("Wall time (relative)")
plt.xscale("log")
plt.legend()
plt.savefig("del_memory_monitor.png")
' | python
rm del_memory_monitor.log
Nice - did you try it with some decrefs commented out?
this is with results_array XDECREF commented out. Looks qualitatively similar ... ?
Hmm, you need bigger examples then. You should be leaking memory heavily there.
Ah, the output array was too small with only 2 time windows.
Here's with XDECREF commented out:
Here's with XDECREF:
looks sane to me. I'm waiting a bit before starting monitoring, here, to skip the initial overhead associated with imports, etc.
memory monitoring code for real
#!/usr/bin/env bash
echo '
import tskit
import numpy as np
import msprime
ts = msprime.sim_ancestry(
100, sequence_length=1e6, recombination_rate=1e-8, population_size=1e4
)
time_windows = np.append(np.unique(ts.nodes_time), np.inf)
windows = np.linspace(0, ts.sequence_length, 100)
sample_sets = list(ts.samples())
sample_set_sizes = [ts.num_samples]
indexes = [(0, 0)]
for _ in range(100000):
ts.ll_tree_sequence.pair_coalescence_counts(
time_windows=time_windows,
windows=windows,
indexes=indexes,
sample_sets=sample_sets,
sample_set_sizes=sample_set_sizes,
span_normalise=False,
)
' >memory_monitor.py
python memory_monitor.py &
sleep 1
for i in {1..1000}; do
ps -eo cmd,%mem,vsz,rsz | grep "python memory_monitor"
sleep 0.1
done >memory_monitor.log
echo '
import numpy as np
import matplotlib.pyplot as plt
vsz, rsz = np.loadtxt("memory_monitor.log", usecols=[2,3]).T
step = np.arange(vsz.size) * 0.1
plt.plot(step, vsz / np.max(vsz), "-", label="VSZ")
plt.plot(step, rsz / np.max(rsz), "-", label="RSZ")
plt.ylabel("Memory usage (divided by max)")
plt.xlabel("Wall time (relative to start)")
plt.legend()
plt.savefig("memory_monitor.png")
' | python
Maybe the thing to do with time windows is change the input in ts.ll_tree_sequence.pair_coalescence_counts from time_windows to integer vector node_bin, that maps nodes to output bins (which could be time bins, or any other delineation). Then ts.pair_coalescence_counts can internally do the binning for a given set of time breakpoints, which allows all three cases of time windows (nodes, unique time points, time intervals) to be passed to the low-level routine.
Nice, that sounds like a good idea. Do you want to push forward with that, or merge this much and iterate? (I haven't reviewed again, will need to take another good look)
I gave it a shot, but it looks like I'm getting stochastic crashes on Windows. Totally perplexed as to why, because the semantics are very similar to the last commit, which worked.
(It could be because I had skipped a bunch of tests in the last commit?)
Working now-- bad malloc (d'oh). This is ready for another look whenever you have time @jeromekelleher .
Here's mem usage once more, with the time_windows -> nodes_output_map change:
Done!
It occurs to me that the order of dimensions here -- (windows, time_windows, indexes) -- should be consistent with the (forthcoming) time windowed stats. @petrelharp what order did you guys settle on?
(windows, time_windows, indexes)
Same!