Scott Todd

Results 400 comments of Scott Todd

I should take over this PR and get it landed. Currently debugging issues where a model we're bringing up hits asserts during compilation (we had been testing with asserts _disabled_...

FWIW, I tried to reproduce this on my machine (NVIDIA 2080TI GPU) both without `--iree-vulkan-target-triple` and with `--iree-vulkan-target-triple=turing-unknown-windows`. Both of those failed to compile, making this tricky to help with...

To improve stability, we can try * updating GPU drivers * adding more runners * running a generic sanity check (e.g. `rocm-smi`) before any test actions * dumping logs (dmesg)...

`test_amd_w7900` is still disabled: https://github.com/iree-org/iree/blob/258707898ae4a62d53468a51dc9dc44a1a8e22e4/.github/workflows/ci.yml#L432-L470 due to https://github.com/iree-org/iree/actions/runs/9178357378/job/25238436482#step:7:100 ``` 6/266 Test #54: iree/hal/drivers/hip/dynamic_symbols_test ...........................................................***Failed 2.12 sec [==========] Running 3 tests from 2 test suites. [----------] Global test environment set-up. [----------]...

@erman-gurses were you going to help re-enable this with fixes for `iree/hal/drivers/hip/dynamic_symbols_test`?

cc @antiagainst @nithinsubbiah , codeowners for `/runtime/src/iree/hal/drivers/hip/`: https://github.com/iree-org/iree/blob/58feff319e2fd0dff7909358741a26ffa5807823/.github/CODEOWNERS#L83 `iree/hal/drivers/hip/dynamic_symbols_test` is failing on CI so the entire job we added to test hip on w7900s has been disabled for 3 weeks....

This still needs attention. Just got another report of a similar test failure [on Discord here](https://discord.com/channels/689900678990135345/689957613152239638/1251073758035185685). Logs: https://github.com/iree-org/iree/actions/runs/9511978721/job/26219408116?pr=17659#step:7:162 This version check https://github.com/iree-org/iree/blob/34282319af42dfbc5bf80c0514b5e8902ed7cb90/runtime/src/iree/hal/drivers/hip/dynamic_symbols_test.cc#L92 Is checking equality against this version https://github.com/iree-org/iree/blob/34282319af42dfbc5bf80c0514b5e8902ed7cb90/third_party/rccl/rccl.h#L20 when...

> @ScottTodd, there is a fix for it in this PR #17433 but it has been blocked because of docker image problems for a while now. I will open a...

* I suspect the Vulkan `failed to legalize operation 'arith.fptosi'` error is in upstream MLIR SPIRV (missing lowering) * Numerical errors in tests could be issues in the torch-mlir lowerings...

We have a separate rotation for updating torch-mlir (in fact, @AmosLewis is up for next week 🤔). They are usually updated separately but needed to be updated together in this...