time windows in statistics
Here @tforest and I are starting in on adding time windows to statistics. We're starting with what was sketched out in #683, and will explain things in more detail here when we're farther along (ignore this for now).
Codecov Report
Attention: Patch coverage is 88.57143% with 12 lines in your changes missing coverage. Please review.
Project coverage is 89.83%. Comparing base (
16de381) to head (0d48891). Report is 25 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| python/tskit/trees.py | 75.00% | 5 Missing and 5 partials :warning: |
| c/tskit/trees.c | 96.00% | 0 Missing and 2 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #2948 +/- ##
==========================================
- Coverage 89.85% 89.83% -0.03%
==========================================
Files 29 29
Lines 32128 32222 +94
Branches 5763 5784 +21
==========================================
+ Hits 28868 28946 +78
- Misses 1859 1868 +9
- Partials 1401 1408 +7
| Flag | Coverage Δ | |
|---|---|---|
| c-tests | 86.71% <96.07%> (+0.01%) |
:arrow_up: |
| lwt-tests | 80.78% <ø> (ø) |
|
| python-c-tests | 89.06% <100.00%> (+<0.01%) |
:arrow_up: |
| python-tests | 98.80% <75.00%> (-0.18%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files with missing lines | Coverage Δ | |
|---|---|---|
| c/tskit/core.c | 95.83% <100.00%> (ø) |
|
| python/_tskitmodule.c | 89.06% <100.00%> (+<0.01%) |
:arrow_up: |
| c/tskit/trees.c | 90.70% <96.00%> (+0.02%) |
:arrow_up: |
| python/tskit/trees.py | 98.24% <75.00%> (-0.57%) |
:arrow_down: |
Note: it is not clear how to do this for site statistics, since the site stat is of the form
$$\sum_a f(w_a)$$
where the sum is over alleles, and $w_a$ is the weight of all samples with allele $a$;
however, it is mutations that have times, not alleles.
The proposal will probably be to compute a site stat that sums over mutations, not alleles, but we'll start with branch stats only for now.
Next step:
- do the AFS first, since it's less tangled up
Also maybe:
- allow
ts.decapitate( )to takeinfas an argument (that does nothing) ?
a small nudge here that i mentioned to @petrelharp in passing-- it would be great to have an expectation from theory as to what time stratified quantities like the SFS should be under the (standard, neutral) coalescent
Some thoughts after working on time windows.
After these edits the moment the output of, let's say, the AFS is a still 2D array of windows, same for time_windows, when using either of them individually. However, when using windows and time_windows at the same time, the output is a 3D array, with the following shape: [num_windows][num_time_windows][sample_size]. When windows or time_windows are None, associated dimensions are dropped accordingly. As there is now two types of windows, it will become ambiguous that the historical "windows" parameter is in fact corresponding specifically to genomic spanning windows. We did not renamed it for now though, as it would break previous behavior.
Some ideas:
- Add new benchmarks for summary stats to see if the implemented features are optimized both in terms of computational space and time complexity.
- Add some plots for summary stats to observe how time windows impact them.
A note on the potential confusion between windows and time_windows - often one endpoint of the time_windows will be Inf, so if we make sure we produce an informative error if the windows aren't finite, we'll help people avoid the mistake.
I've added this work to the next release milestone. Hoping to get a release out in a week or two, if that is too ambitious for this let me know.
Probably too ambitious, but we might have something in by then.