HIP very slow atomic operations over unified memory

Running a HIP program that calls hipManagedMalloc() functions is very slow on a MI-series GPU. If the HIP program is not written properly, please let me know. Thanks.

make source=main-um.cu
hipcc  -std=c++14 -Wall -I../atomicIntrinsics-cuda -O3 -c main-um.cu -o main-um.o
hipcc  -std=c++14 -Wall -I../atomicIntrinsics-cuda -O3 main-um.o -o main
./main 1
PASS
Average kernel execution time: 130903496.000000 (us)

https://github.com/zjin-lcf/HeCBench/blob/master/src/atomicIntrinsics-hip/main-um.cu

Mar 03 '23 22:03 zjin-lcf

Thanks for reporting it. Will look into it.

Mar 15 '23 23:03 jatinx

@jatinx Did you have a chance to look into this? Thanks!

Apr 11 '24 14:04 ppanchad-amd

Hi @zjin-lcf, looks like the file linked is no longer available.

Could you please provide another example so we can try to reproduce this issue internally?

Jul 18 '24 15:07 harkgill-amd

Sorry, the link is updated.

Jul 19 '24 15:07 zjin-lcf

@zjin-lcf that code has every single thread hammering on a small set of locations in very few cache lines. Code doing that should be expected to be "slow". However, we are introducing a compiler optimization to recognize uniform addresses and reduce memory traffic. That will help this code, but won't help the general case where the uniformity can't be deduced.

Jul 19 '24 16:07 b-sumner

Hi @zjin-lcf, please let us know if we can go ahead and close this ticket. Thanks!

Sep 30 '24 17:09 ppanchad-amd