cuvs icon indicating copy to clipboard operation
cuvs copied to clipboard

[BUG] [Java] `CagraRandomizedIT.testResultsTopKWithRandomValues` fails randomly

Open mythrocks opened this issue 4 months ago • 8 comments

CagraRandomizedIT.testResultsTopKWithRandomValues fails occasionally, causing CI pipelines to fail randomly.

This was seen on the second-last CI run on #1366. I was able to repro this locally:

[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.481 s <<< FAILURE! -- in com.nvidia.cuvs.CagraRandomizedIT
[ERROR] com.nvidia.cuvs.CagraRandomizedIT.testResultsTopKWithRandomValues -- Time elapsed: 3.479 s <<< FAILURE!
java.lang.AssertionError: Not found in expected list: 448
        at __randomizedtesting.SeedInfo.seed([DD841DFC69E8C00:24F49D4798CD7240]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at [email protected]/com.nvidia.cuvs.CuVSTestCase.compareResults(CuVSTestCase.java:104)
        at [email protected]/com.nvidia.cuvs.CagraRandomizedIT.tmpResultsTopKWithRandomValues(CagraRandomizedIT.java:170)
        at [email protected]/com.nvidia.cuvs.CagraRandomizedIT.testResultsTopKWithRandomValues(CagraRandomizedIT.java:44)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
        at java.base/java.lang.reflect.Method.invoke(Method.java:580)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1763)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$2.evaluate(ThreadLeakControl.java:426)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSuite(RandomizedRunner.java:716)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.access$200(RandomizedRunner.java:138)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$2.run(RandomizedRunner.java:637)

It would be good to get to the bottom of this. To avoid further churn in CI, it might be good to temporarily back this test out.

mythrocks avatar Oct 28 '25 03:10 mythrocks

@mythrocks is this a duplicate of https://github.com/rapidsai/cuvs/issues/1387, or a new/different issue?

ldematte avatar Oct 28 '25 09:10 ldematte

Ah, there it is. Sorry, yes. This is a dupe of #1387.

On the bright side, we have 2 seeds for failure repro.

mythrocks avatar Oct 28 '25 17:10 mythrocks

It is extremely annoying that neither of the failure seeds are reproducing the failure on my workstation. These were failing reliably late last night.

mythrocks avatar Oct 28 '25 22:10 mythrocks

I can reproduce it with the seed; this week is packed but I'll get around to investigate this as soon as I can.

ldematte avatar Oct 29 '25 08:10 ldematte

@mythrocks I started investigating this. First of all: it fails very rarely; the test generates random datasets and executes random queries with 3 different memory types, 100 times (for a total of 300 each test run). The failure is that the results found by brute-forcing are different from the one from Cagra. It is not a problem of the memory type: it fails with any (Heap, Native, Device), and for the failures I found I dumped the matrix content -- and it stays the same, so no copy error or anything like that.

I've not written this test so I'm not sure where the problem is. The brute-force looks reasonable to me. It might be a cuvs bug, or simply differences in the way CAGRA computes the graph leading to different results. I think this test is valuable and we should run it to prove things operate as expected, but I'm not sure how we can proceed, especially in the second case (no bug, but differences that can happen. How do we weed them out while keeping the test relevant?)

ldematte avatar Nov 06 '25 17:11 ldematte

Also, I think we should close either this or https://github.com/rapidsai/cuvs/issues/1387. Keeping both open will just add noise.

ldematte avatar Nov 06 '25 17:11 ldematte

I thought I'd already closed #1387 as a dupe of this. Closed now.

We'll carry on here. Thanks for picking this up, @ldematte. I'll assign this to you.

The brute-force looks reasonable to me. It might be a cuvs bug, or simply differences in the way CAGRA computes the graph leading to different results.

I'll defer to the domain experts for this question: @benfred, @cjnolet.

mythrocks avatar Nov 06 '25 18:11 mythrocks

We chatted about this offline. We should never be expecting or relying upon any type of exactness in approximate algorithms. Instead we should be using a "summary statistic" like recall and verifying that we consistently stay above a particular recall value. @chatman @narangvivek10 can you take a look at this?

cjnolet avatar Nov 14 '25 16:11 cjnolet