Re-enable w7900 CI jobs when the runner is stable again
Seeing jobs stall:
- https://github.com/iree-org/iree/actions/runs/9054348031/job/24874255455#step:7:92
- https://github.com/iree-org/iree/actions/runs/9054348028/job/24874260894#step:8:28
To improve stability, we can try
- updating GPU drivers
- adding more runners
- running a generic sanity check (e.g.
rocm-smi) before any test actions - dumping logs (dmesg) if errors are detected (assuming nothing sensitive is in the logs)
test_amd_w7900 is still disabled:
https://github.com/iree-org/iree/blob/258707898ae4a62d53468a51dc9dc44a1a8e22e4/.github/workflows/ci.yml#L432-L470
due to https://github.com/iree-org/iree/actions/runs/9178357378/job/25238436482#step:7:100
6/266 Test #54: iree/hal/drivers/hip/dynamic_symbols_test ...........................................................***Failed 2.12 sec
[==========] Running 3 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 2 tests from DynamicSymbolsTest
[ RUN ] DynamicSymbolsTest.CreateFromSystemLoader
[ OK ] DynamicSymbolsTest.CreateFromSystemLoader (2096 ms)
[ RUN ] DynamicSymbolsTest.SearchPathsFail
[ OK ] DynamicSymbolsTest.SearchPathsFail (0 ms)
[----------] 2 tests from DynamicSymbolsTest (2096 ms total)
[----------] 1 test from NCCLDynamicSymbolsTest
[ RUN ] NCCLDynamicSymbolsTest.CreateFromSystemLoader
iree/runtime/src/iree/hal/drivers/hip/dynamic_symbols_test.cc:92: Failure
Expected equality of these values:
21803
nccl_version
Which is: 21806
[ FAILED ] NCCLDynamicSymbolsTest.CreateFromSystemLoader (2 ms)
[----------] 1 test from NCCLDynamicSymbolsTest (2 ms total)
[----------] Global test environment tear-down
[==========] 3 tests from 2 test suites ran. (2099 ms total)
[ PASSED ] 2 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] NCCLDynamicSymbolsTest.CreateFromSystemLoader
1 FAILED TEST
@erman-gurses were you going to help re-enable this with fixes for iree/hal/drivers/hip/dynamic_symbols_test?
@ScottTodd Will try to take a look at it this week.
cc @antiagainst @nithinsubbiah , codeowners for /runtime/src/iree/hal/drivers/hip/:
https://github.com/iree-org/iree/blob/58feff319e2fd0dff7909358741a26ffa5807823/.github/CODEOWNERS#L83
iree/hal/drivers/hip/dynamic_symbols_test is failing on CI so the entire job we added to test hip on w7900s has been disabled for 3 weeks. @erman-gurses may have time to debug since he helped set up the test runner, but these components need a maintainer.
You can put me down as the owner however I can start looking at this only next week.
This still needs attention. Just got another report of a similar test failure on Discord here. Logs: https://github.com/iree-org/iree/actions/runs/9511978721/job/26219408116?pr=17659#step:7:162
This version check https://github.com/iree-org/iree/blob/34282319af42dfbc5bf80c0514b5e8902ed7cb90/runtime/src/iree/hal/drivers/hip/dynamic_symbols_test.cc#L92 Is checking equality against this version https://github.com/iree-org/iree/blob/34282319af42dfbc5bf80c0514b5e8902ed7cb90/third_party/rccl/rccl.h#L20
when it should be checking a minimum version instead
cc @sogartar
@ScottTodd, there is a fix for it in this PR #17433 but it has been blocked because of docker image problems for a while now. I will open a PR with only this fix.
@ScottTodd, there is a fix for it in this PR #17433 but it has been blocked because of docker image problems for a while now. I will open a PR with only this fix.
Already done: https://github.com/iree-org/iree/pull/17674
Please keep PRs focused on a single issue so combined PRs don't end up sitting for long periods.