tskit
tskit copied to clipboard
Fix genetic_relatedness to allow single sample set with self-comparisons
Description
This PR fixes issue #3055 where genetic_relatedness would fail when computing self-comparisons with a single sample set. As @petrelharp pointed out there is no reason this shouldn't be allowed for certain statistics, including genetic_relatedness
The Problem
Previously, calling genetic_relatedness with a single sample set and indexes referring to that set would raise an error:
ts.genetic_relatedness([[0]], indexes=[(0,0)])
# TSK_ERR_INSUFFICIENT_SAMPLE_SETS: Insufficient sample sets provided
This occurred because the underlying C API validates that at least k=2 sample sets are provided for k-way statistics, even when the indexes only reference a subset of those sets.
This PR
The fix adds an allow_self_comparisons parameter to the internal __k_way_sample_set_stat method, which is only set to True for genetic_relatedness. When enabled:
- The method validates that all indexes reference valid sample sets
- If fewer than k sample sets are provided, it pads the list with dummy sample sets to satisfy the C API requirement
- The dummy sets are never accessed during computation since only the sets referenced by the indexes are used
This ensures that:
- Only
genetic_relatednessbehavior is changed - Other statistics (like Fst) continue to enforce the minimum sample set requirement
- The C API contract is satisfied without modifying the C code
Further, this PR sets the stage for allowing for other appropriate k-way stats to be computed with self comparisons by setting a flag.
Testing
Added comprehensive tests for all three computation modes (site, branch, node) to verify:
- Single sample set with self-comparison works correctly
- Multiple samples within a single set work correctly
- The fix doesn't affect other statistics
All existing tests continue to pass, but @petrelharp this would benefit from you specifically looking over what I've done to be sure you're okay with the way I've implemented allow_self_comparisons
Fixes #3055
PR Checklist:
- [x] Tests that fully cover new/changed functionality.
- [x] Documentation including tutorial content if appropriate.
- [] Changelogs, if there are API changes.
I'm not sure if I should touch the changelogs here?
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 89.75%. Comparing base (1d8f453) to head (c173b32).
:warning: Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #3235 +/- ##
=======================================
Coverage 89.75% 89.75%
=======================================
Files 29 29
Lines 32545 32545
Branches 5962 5962
=======================================
Hits 29211 29211
Misses 1890 1890
Partials 1444 1444
| Flag | Coverage Δ | |
|---|---|---|
| c-tests | 86.76% <100.00%> (ø) |
|
| lwt-tests | 80.38% <ø> (ø) |
|
| python-c-tests | 88.24% <ø> (ø) |
|
| python-tests | 98.79% <ø> (ø) |
|
| python-tests-no-jit | 33.02% <ø> (ø) |
|
| python-tests-numpy1 | 50.71% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files with missing lines | Coverage Δ | |
|---|---|---|
| c/tskit/trees.c | 90.81% <100.00%> (ø) |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
I think we'll need to wait for @petrelharp to get informed review on this one.
Bumping this one for you @petrelharp
Ignore my previous comment (now deleted); I was being dense.
Let's see. First, it's quite easy to modify the checking code that produces the error. I went ahead and did that (see this commit: 57ad49f7046885).
However, one can compute 'self' comparisons for other statistics, for instance, this works:
ts.divergence([[0], [1]], indexes=[(0, 0)])
or this
ts.f4([[0],[1],[2],[3]], indexes=[(0,0,0,0)])
These are perfectly valid things to compute, just the thought was they are probably not telling someone what they actually want to know. So, if the intention behind the restriction for "at least k sample sets for a k-way stat" was to keep people from computing things they probably didn't want to compute, then we're not actually doing that. We could require indexes be unique, but I don't think we should actually do that - for instance, f4(..., indexes=[(0,0,1,1)]) is actually a version of f2( ) that people compute sometimes (I think, or maybe it's (0, 1, 0, 1)). So, in the next commit I've just removed the requirement that there be at least k sample sets - instead, requiring there be at least 1 sample set in all cases.
How's that sound, @andrewkern?
LMK how this looks @andrewkern; if you like it I'll squash these commits down.
Rebased; added to CHANGELOG; tests are (should be) passing. Ready to merge when you approve, @andrewkern .
looks good to me @petrelharp -- a couple nit picks about adding comments to help explain the code to future us
See my comments & changes. Could you "approve" the code in review if you're good with it now? (Or let me know if not?)