ROCR-Runtime icon indicating copy to clipboard operation
ROCR-Runtime copied to clipboard

Only the first GPU agent exposes device-local memory regions on a multi-GPU node (ROCr 4.3.x)

Open utkinis opened this issue 2 years ago • 5 comments

We try to use ROCm in a multi-GPU setup with several discrete GPUs per node. We want to allocate device-local memory on several GPUs at the same time. However, we noticed that only the first visible GPU (as specified by ROCR_VISIBLE_DEVICES environemt variable) exposes the memory region that is not accessible by host, is coarse-grained, and the region size matches the amount of device's VRAM.

If a system features multiple discrete GPUs, only the first GPU exposes memory regions associated with the device-local memory. We use the hsa_agent_iterate_regions function to get the list of available memory regions for the agent. This function uses the VisitRegion method internally. We believe that the following check leads to the bug:

https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/c5f95f9b33af2aa1dd1e6ba76b18cd2e291f3c7d/src/core/runtime/amd_gpu_agent.cpp#L480-L485

The local memory regions are accessed only when this->node_id() == core::Runtime::runtime_singleton_->region_gpu()->node_id(), and this is true only for the first GPU discovered, according to the DiscoverGPU method:

https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/c5f95f9b33af2aa1dd1e6ba76b18cd2e291f3c7d/src/core/runtime/runtime.cpp#L207

utkinis avatar Mar 21 '22 14:03 utkinis

This bug is specific to the HSA regions API; the AMD memory pools extension API shows the correct set of memory regions (for global coarsegrained and group only) per-device.

jpsamaroo avatar Mar 28 '22 20:03 jpsamaroo

I'm using only one GPU per process. Even if one GPU is used in a process, only the first used one has local RAM - the second one allocates through PCIe. I removed the check at line 480, recompiled libhsa-runtime64.so.1.4.0 through the aomp scripts + ldconfig, and the problem is solved. Not clean but works.

enerc avatar Apr 03 '22 16:04 enerc

Alternately to using hsa_agent_iterate_regions(), AMD extended APIs on HSA for memory-pool/region allocation hsa_amd_memory_pool_allocate() handles it correctly. As an example, we use the following AMD Extended APIs for memory-pools discovery and allocation on per-device basis using global flags like coarse-grained in ROCm-Bandwidth-Test which is open-source and is available at: https://github.com/RadeonOpenCompute/rocm_bandwidth_test

 --> hsa_amd_agent_iterate_memory_pools
 --> hsa_amd_agent_memory_pool_get_info
 --> hsa_amd_memory_pool_get_info
 --> hsa_amd_memory_pool_allocate

These are defined in hsa_ext_amd.h and hsa_ext_finalize.h which is included in /opt/rocm/include/hsa.

We can use these AMD extended APIs to build/test application.

sanjtrip avatar May 18 '22 01:05 sanjtrip

@vchuravy, Could you try AMD extened APIs posted above to test it ? @srinivamd

sanjtrip avatar May 20 '22 16:05 sanjtrip