ROCR-Runtime icon indicating copy to clipboard operation
ROCR-Runtime copied to clipboard

ROCm 4.5.x error in DiscoverGpu function?

Open tcojean opened this issue 3 years ago • 1 comments

Hello,

We have been seeing since 4.5.0 some errors on our platform using MI100. All traces point to DiscoverGpu being the root of the problem. There indeed appears to be an error in the code, but maybe I am wrong.

https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/rocm-4.5.x/src/core/runtime/amd_topology.cpp#L124-L149

In our case, we pretty much go through this branch (gfx908). gpu is allocated then instantly deleted to be recreated again. The problem is on the deletion: https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/rocm-4.5.x/src/core/runtime/amd_gpu_agent.cpp#L181

In this case, because the GpuAgent was instantly initialized and then deleted, this std::function never got to be initialized (as well as other data) and therefore C++ throws a bad_function_call.

As far as I can see, this std::function for deallocating is initialized in GpuAgent::InitNumaAllocator https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/rocm-4.5.x/src/core/runtime/amd_gpu_agent.cpp#L1572-L1602 which is only called in GpuAgent::PostToolsInit, https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/rocm-4.5.x/src/core/runtime/amd_gpu_agent.cpp#L687-L695 only called in Runtime::Load. https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/rocm-4.5.x/src/core/runtime/runtime.cpp#L1369-L1376

tcojean avatar Feb 21 '22 12:02 tcojean