volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Flaky MPI E2E test

Open JesseStutler opened this issue 5 months ago • 2 comments

Description

The MPI E2E test is flaky and fails occasionally in CI. This issue is to track and fix the flakiness.

A link to the failed job can be found here: https://github.com/volcano-sh/volcano/actions/runs/19453053500/job/55661658286

Relevant logs from the failure:

2025-11-18T04:02:18.8044939Z • [FAILED] [613.831 seconds]
2025-11-18T04:02:18.8045405Z MPI E2E Test [It] will run and complete finally
...
2025-11-18T04:02:18.8046538Z   [FAILED] Unexpected error:
2025-11-18T04:02:18.8046958Z       <*errors.errorString | 0xc0002620d0>:
2025-11-18T04:02:18.8047666Z       [Wait time out]: expected job 'mpi' to be in status Running, actual get Pending

Steps to reproduce the issue

  1. Run the E2E tests in the volcano-sh/volcano repository.
  2. To specifically target the failing test, you can use the following ginkgo command from the test/e2e/jobseq directory:
ginkgo -v --focus="MPI Plugin E2E Test"

Describe the results you received and expected

Expected result: The MPI E2E test should pass consistently. Actual result: The test fails intermittently. The job is expected to be in the 'Running' state, but it gets stuck in 'Pending' until the test times out.

What version of Volcano are you using?

master

Any other relevant information

No response

JesseStutler avatar Nov 18 '25 06:11 JesseStutler

Hi @JesseStutler, Can I work on this issue?

karankoder avatar Dec 07 '25 19:12 karankoder

/assign

karankoder avatar Dec 07 '25 19:12 karankoder