iree icon indicating copy to clipboard operation
iree copied to clipboard

Re-enable w7900 CI jobs when the runner is stable again

Open ScottTodd opened this issue 1 year ago • 1 comments

Seeing jobs stall:

  • https://github.com/iree-org/iree/actions/runs/9054348031/job/24874255455#step:7:92
  • https://github.com/iree-org/iree/actions/runs/9054348028/job/24874260894#step:8:28

ScottTodd avatar May 13 '24 16:05 ScottTodd

To improve stability, we can try

  • updating GPU drivers
  • adding more runners
  • running a generic sanity check (e.g. rocm-smi) before any test actions
  • dumping logs (dmesg) if errors are detected (assuming nothing sensitive is in the logs)

ScottTodd avatar May 13 '24 16:05 ScottTodd

test_amd_w7900 is still disabled: https://github.com/iree-org/iree/blob/258707898ae4a62d53468a51dc9dc44a1a8e22e4/.github/workflows/ci.yml#L432-L470

due to https://github.com/iree-org/iree/actions/runs/9178357378/job/25238436482#step:7:100

  6/266 Test   #54: iree/hal/drivers/hip/dynamic_symbols_test ...........................................................***Failed    2.12 sec
[==========] Running 3 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 2 tests from DynamicSymbolsTest
[ RUN      ] DynamicSymbolsTest.CreateFromSystemLoader
[       OK ] DynamicSymbolsTest.CreateFromSystemLoader (2096 ms)
[ RUN      ] DynamicSymbolsTest.SearchPathsFail
[       OK ] DynamicSymbolsTest.SearchPathsFail (0 ms)
[----------] 2 tests from DynamicSymbolsTest (2096 ms total)

[----------] 1 test from NCCLDynamicSymbolsTest
[ RUN      ] NCCLDynamicSymbolsTest.CreateFromSystemLoader
iree/runtime/src/iree/hal/drivers/hip/dynamic_symbols_test.cc:92: Failure
Expected equality of these values:
  21803
  nccl_version
    Which is: 21806

[  FAILED  ] NCCLDynamicSymbolsTest.CreateFromSystemLoader (2 ms)
[----------] 1 test from NCCLDynamicSymbolsTest (2 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 2 test suites ran. (2099 ms total)
[  PASSED  ] 2 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] NCCLDynamicSymbolsTest.CreateFromSystemLoader

 1 FAILED TEST

ScottTodd avatar May 30 '24 15:05 ScottTodd

@erman-gurses were you going to help re-enable this with fixes for iree/hal/drivers/hip/dynamic_symbols_test?

ScottTodd avatar Jun 04 '24 18:06 ScottTodd

@ScottTodd Will try to take a look at it this week.

erman-gurses avatar Jun 04 '24 18:06 erman-gurses

cc @antiagainst @nithinsubbiah , codeowners for /runtime/src/iree/hal/drivers/hip/:

https://github.com/iree-org/iree/blob/58feff319e2fd0dff7909358741a26ffa5807823/.github/CODEOWNERS#L83

iree/hal/drivers/hip/dynamic_symbols_test is failing on CI so the entire job we added to test hip on w7900s has been disabled for 3 weeks. @erman-gurses may have time to debug since he helped set up the test runner, but these components need a maintainer.

ScottTodd avatar Jun 06 '24 23:06 ScottTodd

You can put me down as the owner however I can start looking at this only next week.

nithinsubbiah avatar Jun 07 '24 19:06 nithinsubbiah

This still needs attention. Just got another report of a similar test failure on Discord here. Logs: https://github.com/iree-org/iree/actions/runs/9511978721/job/26219408116?pr=17659#step:7:162

This version check https://github.com/iree-org/iree/blob/34282319af42dfbc5bf80c0514b5e8902ed7cb90/runtime/src/iree/hal/drivers/hip/dynamic_symbols_test.cc#L92 Is checking equality against this version https://github.com/iree-org/iree/blob/34282319af42dfbc5bf80c0514b5e8902ed7cb90/third_party/rccl/rccl.h#L20

when it should be checking a minimum version instead

cc @sogartar

ScottTodd avatar Jun 14 '24 14:06 ScottTodd

@ScottTodd, there is a fix for it in this PR #17433 but it has been blocked because of docker image problems for a while now. I will open a PR with only this fix.

sogartar avatar Jun 17 '24 17:06 sogartar

@ScottTodd, there is a fix for it in this PR #17433 but it has been blocked because of docker image problems for a while now. I will open a PR with only this fix.

Already done: https://github.com/iree-org/iree/pull/17674

Please keep PRs focused on a single issue so combined PRs don't end up sitting for long periods.

ScottTodd avatar Jun 17 '24 17:06 ScottTodd