Chris Erb
Chris Erb
@atamazov this test docker was on rocm 6.1.0-82
@atamazov ``` version: 5.18.13 srcversion: 7D4E7C8EA7D467BB8AED6A1 vermagic: 5.15.0-105-generic SMP mod_unload modversions ``` Perhaps that would mean updating the base driver on this machine could resolve this issue?
However the base version on our CI nodes is 6.2.4 and we observe the same issue I believe.
> @cderb it seems that somehow #2870 is not effective in this PR's CI? I'll clean up here. Starting debug.
@junliume It appears there are additional tests above what was addressed in #2870 which will hang when xnack is set manually in the environment by HSA_XNACK=1.
> > @junliume It appears there are additional tests above what was addressed in #2870 which will hang when xnack is set manually in the environment by HSA_XNACK=1. > >...
Issue appears to be MI200 specific. Same tests are passing on MI300.
The cause appears to be that the GPU is asleep during the copy and not waking back up when it should. Changing the grub options allowed these tests to pass...
xnack+ make check now passing on machine with modified grub.
Shook out errors in the CI machine. Limited enablement of MI300 CI to dbsync test due to limited hardware.