ROCR-Runtime
ROCR-Runtime copied to clipboard
ROCm 4.5.x error in DiscoverGpu function?
Hello,
We have been seeing since 4.5.0 some errors on our platform using MI100. All traces point to DiscoverGpu being the root of the problem. There indeed appears to be an error in the code, but maybe I am wrong.
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/rocm-4.5.x/src/core/runtime/amd_topology.cpp#L124-L149
In our case, we pretty much go through this branch (gfx908). gpu
is allocated then instantly deleted to be recreated again. The problem is on the deletion:
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/rocm-4.5.x/src/core/runtime/amd_gpu_agent.cpp#L181
In this case, because the GpuAgent
was instantly initialized and then deleted, this std::function
never got to be initialized (as well as other data) and therefore C++ throws a bad_function_call
.
As far as I can see, this std::function
for deallocating is initialized in GpuAgent::InitNumaAllocator
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/rocm-4.5.x/src/core/runtime/amd_gpu_agent.cpp#L1572-L1602
which is only called in GpuAgent::PostToolsInit
,
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/rocm-4.5.x/src/core/runtime/amd_gpu_agent.cpp#L687-L695
only called in Runtime::Load
.
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/rocm-4.5.x/src/core/runtime/runtime.cpp#L1369-L1376