HIP Low bandwidth for a data conversion program in HIP on an AMD GPU

trafficstars

Running a SYCL program on an NVIDIA GPU and an AMD GPU shows that the bandwidth is significantly low on an AMD GPU. Hence, the issue was initially posted (https://github.com/intel/llvm/issues/7195).

After converting the program from SYCL to HIP (https://github.com/zjin-lcf/HeCBench/tree/master/conversion-hip), I find that the bandwidth of the HIP program seems similar to that of the SYCL program. If the HIP program using hipMallocManaged() is not written properly, please let me know. Thanks.

Oct 31 '22 11:10 zjin-lcf

@zjin-lcf Hi, were you able to resolve this issue on the latest HIP? If so can we close this ticket?

Feb 07 '24 19:02 abhimeda

No. Thanks.

Feb 07 '24 19:02 zjin-lcf

@zjin-lcf An internal ticket has been opened to help resolve this issue. Could we have your OS, GPU, CPU versions?

Feb 08 '24 16:02 abhimeda

Ubuntu 22.04, MI100, AMD EPYC 7272 processor

Feb 08 '24 17:02 zjin-lcf

@zjin-lcf : The HIP reproducer mentioned in your initial post is no longer available. Can you please link to a simple HIP reproducer so that we can understand the issue.

May 31 '24 14:05 mangupta

Sorry. The link is updated.

May 31 '24 17:05 zjin-lcf

@zjin-lcf : Thanks for the updated link. I see that you are using hipMallocManaged without using any prefetch hints in your example. Can you share the output of rocminfo | grep amdgcn? Also can you please set the environment variable AMD_LOG_LEVEL=4, rerun the reproducer and share the generated output trace? In the reproducer, I see that you the source and destination buffers are allocated using hipMallocManaged, but not populated using any data. Probably because this is a synthetic test. In that case, can you see what happens if you replace hipMallocManaged with hipMalloc instead?

Jun 05 '24 05:06 mangupta

amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-

After the replacement, I observe significant BW increase.

I replaced MallocManaged with Malloc in the GPU programs in the repository. I think the programs are supposed to measure the conversion time only.

It seems that the impact of cudaMallocManaged on the execution time of the CUDA program is small. Thanks for your comments/suggestions.

Jun 05 '24 15:06 zjin-lcf

@zjin-lcf : "amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-" does not support managed memory. HIP runtime will fallback to allocating system memory that is made visible to all devices instead to keep the code functional. That is why you are seeing reduced performance and see better performance when using hipMalloc instead.

Jun 05 '24 15:06 mangupta

Ok. Is there a link that explains the difference between managed memory and system memory ?

Jun 05 '24 17:06 zjin-lcf

system memory = host memory. It is the same as memory allocated using hipHostMalloc.

Jun 06 '24 04:06 mangupta

HIP HIP copied to clipboard

Low bandwidth for a data conversion program in HIP on an AMD GPU

HIP
HIP copied to clipboard