volcano
volcano copied to clipboard
Flaky MPI E2E test
Description
The MPI E2E test is flaky and fails occasionally in CI. This issue is to track and fix the flakiness.
A link to the failed job can be found here: https://github.com/volcano-sh/volcano/actions/runs/19453053500/job/55661658286
Relevant logs from the failure:
2025-11-18T04:02:18.8044939Z • [FAILED] [613.831 seconds]
2025-11-18T04:02:18.8045405Z MPI E2E Test [It] will run and complete finally
...
2025-11-18T04:02:18.8046538Z [FAILED] Unexpected error:
2025-11-18T04:02:18.8046958Z <*errors.errorString | 0xc0002620d0>:
2025-11-18T04:02:18.8047666Z [Wait time out]: expected job 'mpi' to be in status Running, actual get Pending
Steps to reproduce the issue
- Run the E2E tests in the
volcano-sh/volcanorepository. - To specifically target the failing test, you can use the following
ginkgocommand from thetest/e2e/jobseqdirectory:
ginkgo -v --focus="MPI Plugin E2E Test"
Describe the results you received and expected
Expected result: The MPI E2E test should pass consistently. Actual result: The test fails intermittently. The job is expected to be in the 'Running' state, but it gets stuck in 'Pending' until the test times out.
What version of Volcano are you using?
master
Any other relevant information
No response
Hi @JesseStutler, Can I work on this issue?
/assign