very slow atomic operations over unified memory
Running a HIP program that calls hipManagedMalloc() functions is very slow on a MI-series GPU. If the HIP program is not written properly, please let me know. Thanks.
make source=main-um.cu
hipcc -std=c++14 -Wall -I../atomicIntrinsics-cuda -O3 -c main-um.cu -o main-um.o
hipcc -std=c++14 -Wall -I../atomicIntrinsics-cuda -O3 main-um.o -o main
./main 1
PASS
Average kernel execution time: 130903496.000000 (us)
https://github.com/zjin-lcf/HeCBench/blob/master/src/atomicIntrinsics-hip/main-um.cu
Thanks for reporting it. Will look into it.
@jatinx Did you have a chance to look into this? Thanks!
Hi @zjin-lcf, looks like the file linked is no longer available.
Could you please provide another example so we can try to reproduce this issue internally?
Sorry, the link is updated.
@zjin-lcf that code has every single thread hammering on a small set of locations in very few cache lines. Code doing that should be expected to be "slow". However, we are introducing a compiler optimization to recognize uniform addresses and reduce memory traffic. That will help this code, but won't help the general case where the uniformity can't be deduced.
Hi @zjin-lcf, please let us know if we can go ahead and close this ticket. Thanks!