HIP
HIP copied to clipboard
Drastic difference in execution time between CUDA and HIP versions of the same code
Hi there,
I've tried running this: https://www.particleincell.com/wp-content/uploads/2016/02/sheath-gpu.cu reference: https://www.particleincell.com/2016/cuda-pic On AMD Radeon RX Vega 64, and a Nvidia 1080 Max-Q,
AMD
Found GPU 'Device 687f' with 8176 Gb of global memory, max 1024 threads per block, and 64
multiprocessors
TS:25 np_i:500000 np_e:500000 dphi:1.4
TS:50 np_i:500000 np_e:500000 dphi:4.17
TS:75 np_i:500000 np_e:500000 dphi:5.82
TS:100 np_i:500000 np_e:500000 dphi:5.02
TS:125 np_i:500000 np_e:500000 dphi:4.06
...
TS:9925 np_i:500000 np_e:500000 dphi:6.85
TS:9950 np_i:500000 np_e:500000 dphi:5.66
TS:9975 np_i:500000 np_e:500000 dphi:5.46
TS:10000 np_i:500000 np_e:500000 dphi:5.42
Time per time step: 155 ms
Nvidia
Found GPU 'GeForce GTX 1080 with Max-Q Design' with 8114.44 Gb of global memory, max 1024 threads
per block, and 20 multiprocessors
TS:25 np_i:500000 np_e:500000 dphi:1.38
TS:50 np_i:500000 np_e:500000 dphi:4.19
TS:75 np_i:500000 np_e:500000 dphi:5.93
TS:100 np_i:500000 np_e:500000 dphi:5.05
TS:125 np_i:500000 np_e:500000 dphi:4.04
...
TS:9925 np_i:500000 np_e:500000 dphi:6.85
TS:9950 np_i:500000 np_e:500000 dphi:5.77
TS:9975 np_i:500000 np_e:500000 dphi:5.41
TS:10000 np_i:500000 np_e:500000 dphi:5.4
Time per time step: 4.32 ms
both versions which are as identical as can be can be found here: https://github.com/PhilipDeegan/PHARE/blob/gpu/tools/hw/gpu/bench/sheath-gpu.hip.cpp https://github.com/PhilipDeegan/PHARE/blob/gpu/tools/hw/gpu/bench/sheath-gpu.cuda.cpp
I should add the RX Vega testing was done inside docker so the comparisons are not exactly identical
RX Vega machine: CPU: AMD Ryzen 9 3950X 16-Core Processor Host OS: Debian GNU/Linux bullseye/sid Kernel: Linux fad 5.10.0-3-amd64 #1 SMP Debian 5.10.13-1 (2021-02-06) x86_64 GNU/Linux Container OS: Ubuntu 20.04.2 LTS
1080 Max-Q machine CPU: Intel(R) Core(TM) i7-7820HK CPU @ 2.90GHz Host OS: Ubuntu 20.04.2 LTS Kernel: Linux feh 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
I'm happy to be told this is a problem with my setup, but if it's not, then maybe you'd like to know that too.
/opt/rocm/bin/hipconfig --full
HIP version : 4.0.20496-4f163c68
== hipconfig
HIP_PATH : /opt/rocm-4.0.0/hip
ROCM_PATH : /opt/rocm-4.0.0
HIP_COMPILER : clang
HIP_PLATFORM : hcc
HIP_RUNTIME : ROCclr
CPP_CONFIG : -D__HIP_PLATFORM_HCC__= -I/opt/rocm-4.0.0/hip/include -I/opt/rocm-4.0.0/llvm/bin/../lib/clang/12.0.0 -I/opt/rocm-4.0.0/hsa/include -D__HIP_ROCclr__
== hip-clang
HSA_PATH : /opt/rocm-4.0.0/hsa
HIP_CLANG_PATH : /opt/rocm-4.0.0/llvm/bin
clang version 12.0.0 (/src/external/llvm-project/clang dac2bfceaa8d4a90257dc8a6d58f268e172ce00e)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-4.0.0/llvm/bin
LLVM (http://llvm.org/):
LLVM version 12.0.0git
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: znver2
Registered Targets:
amdgcn - AMD GCN GPUs
r600 - AMD GPUs HD2XXX-HD6XXX
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags : -D__HIP_ROCclr__ -std=c++11 -isystem /opt/rocm-4.0.0/llvm/lib/clang/12.0.0/include/.. -isystem /opt/rocm-4.0.0/hsa/include -D__HIP_ROCclr__ -isystem /opt/rocm-4.0.0/hip/include -O3
hip-clang-ldflags : -L/opt/rocm-4.0.0/hip/lib -O3 -lgcc_s -lgcc -lpthread -lm
=== Environment Variables
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
== Linux Kernel
Hostname : 4ac74b05b95d
Linux 4ac74b05b95d 5.10.0-3-amd64 #1 SMP Debian 5.10.13-1 (2021-02-06) x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal
Try 256 threads per block. In general you have to profile each kernel and report the slowest to the compiler.
Try 256 threads per block
it's already configured to 256 threads per block https://github.com/PhilipDeegan/PHARE/blob/gpu/tools/hw/gpu/bench/sheath-gpu.hip.cpp#L51
In general you have to profile each kernel and report the slowest to the compiler.
can you expand on this please?
Use rocprofiler (--hip-trace option), it should produce JSON compatible with chrome://tracing. Then identify where runtime/HW spends the most of time.
Thank you for the example. I could reproduce the performance issue. If there is a MI100 GPU, please give it a try.
@PhilipDeegan, Sorry for the lack of response. Please try latest ROCm 6.0.2 (HIP 6.0.32831) to see if your issue still exists? If resolved, please close the ticket. Thanks.
I only have a gaming AMD GPU atm so I cant' really test this any more
but, on my 6900XT, it's like Time per time step: 337 ms
on ROCm 6.0.2
@PhilipDeegan Thanks! Internal ticket has been created to investigate this issue.