HIP icon indicating copy to clipboard operation
HIP copied to clipboard

very slow atomic operations over unified memory

Open zjin-lcf opened this issue 2 years ago • 5 comments

Running a HIP program that calls hipManagedMalloc() functions is very slow on a MI-series GPU. If the HIP program is not written properly, please let me know. Thanks.

make source=main-um.cu
hipcc  -std=c++14 -Wall -I../atomicIntrinsics-cuda -O3 -c main-um.cu -o main-um.o
hipcc  -std=c++14 -Wall -I../atomicIntrinsics-cuda -O3 main-um.o -o main
./main 1
PASS
Average kernel execution time: 130903496.000000 (us)

https://github.com/zjin-lcf/HeCBench/blob/master/src/atomicIntrinsics-hip/main-um.cu

zjin-lcf avatar Mar 03 '23 22:03 zjin-lcf

Thanks for reporting it. Will look into it.

jatinx avatar Mar 15 '23 23:03 jatinx

@jatinx Did you have a chance to look into this? Thanks!

ppanchad-amd avatar Apr 11 '24 14:04 ppanchad-amd

Hi @zjin-lcf, looks like the file linked is no longer available.

Could you please provide another example so we can try to reproduce this issue internally?

harkgill-amd avatar Jul 18 '24 15:07 harkgill-amd

Sorry, the link is updated.

zjin-lcf avatar Jul 19 '24 15:07 zjin-lcf

@zjin-lcf that code has every single thread hammering on a small set of locations in very few cache lines. Code doing that should be expected to be "slow". However, we are introducing a compiler optimization to recognize uniform addresses and reduce memory traffic. That will help this code, but won't help the general case where the uniformity can't be deduced.

b-sumner avatar Jul 19 '24 16:07 b-sumner

Hi @zjin-lcf, please let us know if we can go ahead and close this ticket. Thanks!

ppanchad-amd avatar Sep 30 '24 17:09 ppanchad-amd