HIP
HIP copied to clipboard
Low bandwidth for a data conversion program in HIP on an AMD GPU
Running a SYCL program on an NVIDIA GPU and an AMD GPU shows that the bandwidth is significantly low on an AMD GPU. Hence, the issue was initially posted (https://github.com/intel/llvm/issues/7195).
After converting the program from SYCL to HIP (https://github.com/zjin-lcf/HeCBench/tree/master/conversion-hip), I find that the bandwidth of the HIP program seems similar to that of the SYCL program. If the HIP program using hipMallocManaged() is not written properly, please let me know. Thanks.
@zjin-lcf Hi, were you able to resolve this issue on the latest HIP? If so can we close this ticket?
No. Thanks.
@zjin-lcf An internal ticket has been opened to help resolve this issue. Could we have your OS, GPU, CPU versions?
Ubuntu 22.04, MI100, AMD EPYC 7272 processor
@zjin-lcf : The HIP reproducer mentioned in your initial post is no longer available. Can you please link to a simple HIP reproducer so that we can understand the issue.
Sorry. The link is updated.
@zjin-lcf : Thanks for the updated link. I see that you are using hipMallocManaged without using any prefetch hints in your example. Can you share the output of rocminfo | grep amdgcn?
Also can you please set the environment variable AMD_LOG_LEVEL=4, rerun the reproducer and share the generated output trace?
In the reproducer, I see that you the source and destination buffers are allocated using hipMallocManaged, but not populated using any data. Probably because this is a synthetic test. In that case, can you see what happens if you replace hipMallocManaged with hipMalloc instead?
amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
After the replacement, I observe significant BW increase.
I replaced MallocManaged with Malloc in the GPU programs in the repository. I think the programs are supposed to measure the conversion time only.
It seems that the impact of cudaMallocManaged on the execution time of the CUDA program is small. Thanks for your comments/suggestions.
@zjin-lcf : "amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-" does not support managed memory. HIP runtime will fallback to allocating system memory that is made visible to all devices instead to keep the code functional. That is why you are seeing reduced performance and see better performance when using hipMalloc instead.
Ok. Is there a link that explains the difference between managed memory and system memory ?
system memory = host memory. It is the same as memory allocated using hipHostMalloc.