ROCR-Runtime
ROCR-Runtime copied to clipboard
Only the first GPU agent exposes device-local memory regions on a multi-GPU node (ROCr 4.3.x)
We try to use ROCm in a multi-GPU setup with several discrete GPUs per node. We want to allocate device-local memory on several GPUs at the same time. However, we noticed that only the first visible GPU (as specified by ROCR_VISIBLE_DEVICES
environemt variable) exposes the memory region that is not accessible by host, is coarse-grained, and the region size matches the amount of device's VRAM.
If a system features multiple discrete GPUs, only the first GPU exposes memory regions associated with the device-local memory.
We use the hsa_agent_iterate_regions
function to get the list of available memory regions for the agent. This function uses the VisitRegion
method internally. We believe that the following check leads to the bug:
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/c5f95f9b33af2aa1dd1e6ba76b18cd2e291f3c7d/src/core/runtime/amd_gpu_agent.cpp#L480-L485
The local memory regions are accessed only when this->node_id() == core::Runtime::runtime_singleton_->region_gpu()->node_id()
, and this is true only for the first GPU discovered, according to the DiscoverGPU
method:
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/c5f95f9b33af2aa1dd1e6ba76b18cd2e291f3c7d/src/core/runtime/runtime.cpp#L207
This bug is specific to the HSA regions API; the AMD memory pools extension API shows the correct set of memory regions (for global coarsegrained and group only) per-device.
I'm using only one GPU per process. Even if one GPU is used in a process, only the first used one has local RAM - the second one allocates through PCIe. I removed the check at line 480, recompiled libhsa-runtime64.so.1.4.0 through the aomp scripts + ldconfig, and the problem is solved. Not clean but works.
Alternately to using hsa_agent_iterate_regions(), AMD extended APIs on HSA for memory-pool/region allocation hsa_amd_memory_pool_allocate() handles it correctly. As an example, we use the following AMD Extended APIs for memory-pools discovery and allocation on per-device basis using global flags like coarse-grained in ROCm-Bandwidth-Test which is open-source and is available at: https://github.com/RadeonOpenCompute/rocm_bandwidth_test
--> hsa_amd_agent_iterate_memory_pools
--> hsa_amd_agent_memory_pool_get_info
--> hsa_amd_memory_pool_get_info
--> hsa_amd_memory_pool_allocate
These are defined in hsa_ext_amd.h and hsa_ext_finalize.h which is included in /opt/rocm/include/hsa.
We can use these AMD extended APIs to build/test application.
@vchuravy, Could you try AMD extened APIs posted above to test it ? @srinivamd