HIP icon indicating copy to clipboard operation
HIP copied to clipboard

Low bandwidth for a data conversion program in HIP on an AMD GPU

Open zjin-lcf opened this issue 3 years ago • 4 comments
trafficstars

Running a SYCL program on an NVIDIA GPU and an AMD GPU shows that the bandwidth is significantly low on an AMD GPU. Hence, the issue was initially posted (https://github.com/intel/llvm/issues/7195).

After converting the program from SYCL to HIP (https://github.com/zjin-lcf/HeCBench/tree/master/conversion-hip), I find that the bandwidth of the HIP program seems similar to that of the SYCL program. If the HIP program using hipMallocManaged() is not written properly, please let me know. Thanks.

zjin-lcf avatar Oct 31 '22 11:10 zjin-lcf

@zjin-lcf Hi, were you able to resolve this issue on the latest HIP? If so can we close this ticket?

abhimeda avatar Feb 07 '24 19:02 abhimeda

No. Thanks.

zjin-lcf avatar Feb 07 '24 19:02 zjin-lcf

@zjin-lcf An internal ticket has been opened to help resolve this issue. Could we have your OS, GPU, CPU versions?

abhimeda avatar Feb 08 '24 16:02 abhimeda

Ubuntu 22.04, MI100, AMD EPYC 7272 processor

zjin-lcf avatar Feb 08 '24 17:02 zjin-lcf

@zjin-lcf : The HIP reproducer mentioned in your initial post is no longer available. Can you please link to a simple HIP reproducer so that we can understand the issue.

mangupta avatar May 31 '24 14:05 mangupta

Sorry. The link is updated.

zjin-lcf avatar May 31 '24 17:05 zjin-lcf

@zjin-lcf : Thanks for the updated link. I see that you are using hipMallocManaged without using any prefetch hints in your example. Can you share the output of rocminfo | grep amdgcn? Also can you please set the environment variable AMD_LOG_LEVEL=4, rerun the reproducer and share the generated output trace? In the reproducer, I see that you the source and destination buffers are allocated using hipMallocManaged, but not populated using any data. Probably because this is a synthetic test. In that case, can you see what happens if you replace hipMallocManaged with hipMalloc instead?

mangupta avatar Jun 05 '24 05:06 mangupta

amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-

After the replacement, I observe significant BW increase.

I replaced MallocManaged with Malloc in the GPU programs in the repository. I think the programs are supposed to measure the conversion time only.

It seems that the impact of cudaMallocManaged on the execution time of the CUDA program is small. Thanks for your comments/suggestions.

zjin-lcf avatar Jun 05 '24 15:06 zjin-lcf

@zjin-lcf : "amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-" does not support managed memory. HIP runtime will fallback to allocating system memory that is made visible to all devices instead to keep the code functional. That is why you are seeing reduced performance and see better performance when using hipMalloc instead.

mangupta avatar Jun 05 '24 15:06 mangupta

Ok. Is there a link that explains the difference between managed memory and system memory ?

zjin-lcf avatar Jun 05 '24 17:06 zjin-lcf

system memory = host memory. It is the same as memory allocated using hipHostMalloc.

mangupta avatar Jun 06 '24 04:06 mangupta