sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

LD prune test is failing due to differences with scikit-allel

Open tomwhite opened this issue 2 years ago • 4 comments

Hypothesis is finding cases where sgkit and scikit-allel differ.

For example:

=================================== FAILURES ===================================
_______________________________ test_vs_skallel ________________________________
    @given(args=ld_prune_args())  # pylint: disable=no-value-for-parameter
>   @settings(max_examples=50, deadline=None, phases=PHASES_NO_SHRINK)
sgkit/tests/test_ld.py:158: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
args = (array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0... 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=uint8), 27, 2, 0.0, 19)
    @given(args=ld_prune_args())  # pylint: disable=no-value-for-parameter
    @settings(max_examples=50, deadline=None, phases=PHASES_NO_SHRINK)
    @example(args=(np.array([[1, 1], [1, 1]], dtype="uint8"), 1, 1, 0.0, -1))
    def test_vs_skallel(args):
        x, size, step, threshold, chunks = args
        ds = simulate_genotype_call_dataset(n_variant=x.shape[0], n_sample=x.shape[1])
        ds["dosage"] = (["variants", "samples"], da.asarray(x).rechunk({0: chunks}))
        ds = window_by_variant(ds, size=size, step=step)
        ldm = ld_matrix(ds, threshold=threshold)
        has_duplicates = ldm.compute().duplicated(subset=["i", "j"]).any()
        assert not has_duplicates
        idx_drop_ds = maximal_independent_set(ldm)
        idx_drop = np.sort(idx_drop_ds.ld_prune_index_to_drop.data)
        m = allel.locate_unlinked(x, size=size, step=step, threshold=threshold)
        idx_drop_ska = np.sort(np.argwhere(~m).squeeze(axis=1))
>       npt.assert_equal(idx_drop_ska, idx_drop)
E       AssertionError: 
E       Arrays are not equal
E       
E       (shapes (0,), (74,) mismatch)
E        x: array([], dtype=int64)
E        y: array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
E              18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 33, 34, 35,
E              36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,...

sgkit/tests/test_ld.py:176: AssertionError
---------------------------------- Hypothesis ----------------------------------
Falsifying example: test_vs_skallel(
    args=(array([[0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 2, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0]], dtype=uint8), 27, 2, 0.0, 19),
)
You can reproduce this example by temporarily adding @reproduce_failure('6.47.0', b'AEsCAQEBAgADABoBAQMBAAAAAAAAAAAACQ==') as a decorator on your test case

From https://github.com/pystatgen/sgkit/runs/6813432367?check_suite_focus=true#step:8:1124.

At a cursory glance this looks like it could be due to a problem with precision, which we've had with Rogers Huff calculations before. I'm not sure why this has started happening now though - perhaps a more recent Hypothesis version has a different search strategy?

tomwhite avatar Jun 09 '22 16:06 tomwhite

Hypothesis is a bit of a mixed blessing in my experience - it does find important corner cases sometimes, but you spend a lot of time figuring out how to avoid unimportant cases that'll just never happen.

I'd be happy to skip this test for now, and log the hypothesis thing as an issue.

jeromekelleher avatar Jun 10 '22 07:06 jeromekelleher

With a threshold of 0 and all variants but one having the same genotype across all samples, I think sgkit is doing the correct thing here while scikit-allel is not. I'm pretty sure the difference is a result of this discrepancy:

allel/opt/stats.pyx#L68-L70

    # early out
    if n == 0 or v0 == 0 or v1 == 0:
        return nan32

https://github.com/pystatgen/sgkit/blob/b99c708040d8ef647599ac6cc9b43df0ccce5f94/sgkit/stats/ld.py#L48-L50

Because of this, skallel is considering the LD to be NaN for each variant pair rather than 1 for all pairs except those involving the sole segregating variant in this test case. This results in skallel eliminating none of the variants while sgkit is eliminating all but 3.

avoid unimportant cases that'll just never happen

I think that's mostly true here since the elimination of non-segregating variants is likely to have happened already, or at least it should have in a real-life workflow. My suggestion would be to add that step to the test.

eric-czech avatar Jun 10 '22 08:06 eric-czech

Thanks @eric-czech! I think that explains the error I posted. I tried setting the range of threshold to exclude 0, but I get other failures.

diff --git a/sgkit/tests/test_ld.py b/sgkit/tests/test_ld.py
index e4bce760..f99bac77 100644
--- a/sgkit/tests/test_ld.py
+++ b/sgkit/tests/test_ld.py
@@ -143,7 +143,7 @@ def ld_prune_args(draw):
     assert x.ndim == 2
     window = draw(st.integers(1, x.shape[0]))
     step = draw(st.integers(1, window))
-    threshold = draw(st.floats(0, 1))
+    threshold = draw(st.floats(1e-2, 1))
     chunks = draw(st.integers(10, window * 3)) if window > 10 else -1
     return x, window, step, threshold, chunks

For example:

>       npt.assert_equal(idx_drop_ska, idx_drop)
E       AssertionError: 
E       Arrays are not equal
E       
E       (shapes (3,), (4,) mismatch)
E        x: array([10, 12, 13])
E        y: array([ 9, 10, 12, 13])

sgkit/tests/test_ld.py:176: AssertionError
--------------------------------------------------------------------------- Hypothesis ----------------------------------------------------------------------------
Falsifying example: test_vs_skallel(
    args=(array([[2, 0, 2, 2],
            [2, 2, 2, 2],
            [2, 2, 2, 2],
            [2, 2, 2, 2],
            [2, 1, 2, 0],
            [2, 2, 1, 2],
            [2, 2, 2, 2],
            [1, 2, 2, 2],
            [2, 2, 2, 2],
            [1, 2, 2, 1],
            [0, 2, 1, 2],
            [2, 0, 2, 2],
            [2, 0, 2, 2],
            [2, 1, 2, 2],
            [2, 2, 2, 2],
            [2, 2, 1, 2],
            [2, 2, 1, 0],
            [2, 2, 2, 2],
            [2, 2, 2, 2],
            [0, 2, 2, 2]], dtype=uint8), 8, 1, 0.3333333333333333, -1),
)

You can reproduce this example by temporarily adding @reproduce_failure('6.47.0', b'AXicE+LmYmIsK1Bh5GIUY2RhtGbiY3Rm4GI0ZeRlrNdiZgZSTozsjIYMPIxJfsxMTIw+DGyM6iBhO0ZWxkwLoIguAyejIxM3oyYQG4OU5CTJAE1KY2RgZSwTBhKCQJWyQPEsDQYGZiZ2BmYeRoYDUqvAAADIWA5T') as a decorator on your test case

My suggestion would be to add that step to the test.

I'm not sure how to do this. Do you think that would fix the error above?

tomwhite avatar Jun 10 '22 10:06 tomwhite

I'm not sure how to do this. Do you think that would fix the error above?

Hey @tomwhite sorry for the delay, but I don't think it would. Maybe that one is from a precision difference.

FWIW what I had in mind was to put something like x = x[(np.diff(x, axis=1) != 0).any(axis=1)] (remove rows/variants with identical values) in the hypothesis function right after: https://github.com/pystatgen/sgkit/blob/b99c708040d8ef647599ac6cc9b43df0ccce5f94/sgkit/tests/test_ld.py#L144

That would be better than doing it in the test since some of the other parameters drawn depend on it. I don't think that will help with that last example, but it should prevent some other impractical cases from being tested.

eric-czech avatar Jun 16 '22 18:06 eric-czech