sgkit
sgkit copied to clipboard
LD prune test is failing due to differences with scikit-allel
Hypothesis is finding cases where sgkit and scikit-allel differ.
For example:
=================================== FAILURES ===================================
_______________________________ test_vs_skallel ________________________________
@given(args=ld_prune_args()) # pylint: disable=no-value-for-parameter
> @settings(max_examples=50, deadline=None, phases=PHASES_NO_SHRINK)
sgkit/tests/test_ld.py:158:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = (array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0... 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]], dtype=uint8), 27, 2, 0.0, 19)
@given(args=ld_prune_args()) # pylint: disable=no-value-for-parameter
@settings(max_examples=50, deadline=None, phases=PHASES_NO_SHRINK)
@example(args=(np.array([[1, 1], [1, 1]], dtype="uint8"), 1, 1, 0.0, -1))
def test_vs_skallel(args):
x, size, step, threshold, chunks = args
ds = simulate_genotype_call_dataset(n_variant=x.shape[0], n_sample=x.shape[1])
ds["dosage"] = (["variants", "samples"], da.asarray(x).rechunk({0: chunks}))
ds = window_by_variant(ds, size=size, step=step)
ldm = ld_matrix(ds, threshold=threshold)
has_duplicates = ldm.compute().duplicated(subset=["i", "j"]).any()
assert not has_duplicates
idx_drop_ds = maximal_independent_set(ldm)
idx_drop = np.sort(idx_drop_ds.ld_prune_index_to_drop.data)
m = allel.locate_unlinked(x, size=size, step=step, threshold=threshold)
idx_drop_ska = np.sort(np.argwhere(~m).squeeze(axis=1))
> npt.assert_equal(idx_drop_ska, idx_drop)
E AssertionError:
E Arrays are not equal
E
E (shapes (0,), (74,) mismatch)
E x: array([], dtype=int64)
E y: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
E 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 33, 34, 35,
E 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,...
sgkit/tests/test_ld.py:176: AssertionError
---------------------------------- Hypothesis ----------------------------------
Falsifying example: test_vs_skallel(
args=(array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 2, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]], dtype=uint8), 27, 2, 0.0, 19),
)
You can reproduce this example by temporarily adding @reproduce_failure('6.47.0', b'AEsCAQEBAgADABoBAQMBAAAAAAAAAAAACQ==') as a decorator on your test case
From https://github.com/pystatgen/sgkit/runs/6813432367?check_suite_focus=true#step:8:1124.
At a cursory glance this looks like it could be due to a problem with precision, which we've had with Rogers Huff calculations before. I'm not sure why this has started happening now though - perhaps a more recent Hypothesis version has a different search strategy?
Hypothesis is a bit of a mixed blessing in my experience - it does find important corner cases sometimes, but you spend a lot of time figuring out how to avoid unimportant cases that'll just never happen.
I'd be happy to skip this test for now, and log the hypothesis thing as an issue.
With a threshold of 0 and all variants but one having the same genotype across all samples, I think sgkit is doing the correct thing here while scikit-allel is not. I'm pretty sure the difference is a result of this discrepancy:
# early out
if n == 0 or v0 == 0 or v1 == 0:
return nan32
https://github.com/pystatgen/sgkit/blob/b99c708040d8ef647599ac6cc9b43df0ccce5f94/sgkit/stats/ld.py#L48-L50
Because of this, skallel is considering the LD to be NaN for each variant pair rather than 1 for all pairs except those involving the sole segregating variant in this test case. This results in skallel eliminating none of the variants while sgkit is eliminating all but 3.
avoid unimportant cases that'll just never happen
I think that's mostly true here since the elimination of non-segregating variants is likely to have happened already, or at least it should have in a real-life workflow. My suggestion would be to add that step to the test.
Thanks @eric-czech! I think that explains the error I posted. I tried setting the range of threshold
to exclude 0, but I get other failures.
diff --git a/sgkit/tests/test_ld.py b/sgkit/tests/test_ld.py
index e4bce760..f99bac77 100644
--- a/sgkit/tests/test_ld.py
+++ b/sgkit/tests/test_ld.py
@@ -143,7 +143,7 @@ def ld_prune_args(draw):
assert x.ndim == 2
window = draw(st.integers(1, x.shape[0]))
step = draw(st.integers(1, window))
- threshold = draw(st.floats(0, 1))
+ threshold = draw(st.floats(1e-2, 1))
chunks = draw(st.integers(10, window * 3)) if window > 10 else -1
return x, window, step, threshold, chunks
For example:
> npt.assert_equal(idx_drop_ska, idx_drop)
E AssertionError:
E Arrays are not equal
E
E (shapes (3,), (4,) mismatch)
E x: array([10, 12, 13])
E y: array([ 9, 10, 12, 13])
sgkit/tests/test_ld.py:176: AssertionError
--------------------------------------------------------------------------- Hypothesis ----------------------------------------------------------------------------
Falsifying example: test_vs_skallel(
args=(array([[2, 0, 2, 2],
[2, 2, 2, 2],
[2, 2, 2, 2],
[2, 2, 2, 2],
[2, 1, 2, 0],
[2, 2, 1, 2],
[2, 2, 2, 2],
[1, 2, 2, 2],
[2, 2, 2, 2],
[1, 2, 2, 1],
[0, 2, 1, 2],
[2, 0, 2, 2],
[2, 0, 2, 2],
[2, 1, 2, 2],
[2, 2, 2, 2],
[2, 2, 1, 2],
[2, 2, 1, 0],
[2, 2, 2, 2],
[2, 2, 2, 2],
[0, 2, 2, 2]], dtype=uint8), 8, 1, 0.3333333333333333, -1),
)
You can reproduce this example by temporarily adding @reproduce_failure('6.47.0', b'AXicE+LmYmIsK1Bh5GIUY2RhtGbiY3Rm4GI0ZeRlrNdiZgZSTozsjIYMPIxJfsxMTIw+DGyM6iBhO0ZWxkwLoIguAyejIxM3oyYQG4OU5CTJAE1KY2RgZSwTBhKCQJWyQPEsDQYGZiZ2BmYeRoYDUqvAAADIWA5T') as a decorator on your test case
My suggestion would be to add that step to the test.
I'm not sure how to do this. Do you think that would fix the error above?
I'm not sure how to do this. Do you think that would fix the error above?
Hey @tomwhite sorry for the delay, but I don't think it would. Maybe that one is from a precision difference.
FWIW what I had in mind was to put something like x = x[(np.diff(x, axis=1) != 0).any(axis=1)]
(remove rows/variants with identical values) in the hypothesis function right after: https://github.com/pystatgen/sgkit/blob/b99c708040d8ef647599ac6cc9b43df0ccce5f94/sgkit/tests/test_ld.py#L144
That would be better than doing it in the test since some of the other parameters drawn depend on it. I don't think that will help with that last example, but it should prevent some other impractical cases from being tested.