HIP icon indicating copy to clipboard operation
HIP copied to clipboard

Drastic difference in execution time between CUDA and HIP versions of the same code

Open PhilipDeegan opened this issue 3 years ago • 8 comments

Hi there,

I've tried running this: https://www.particleincell.com/wp-content/uploads/2016/02/sheath-gpu.cu reference: https://www.particleincell.com/2016/cuda-pic On AMD Radeon RX Vega 64, and a Nvidia 1080 Max-Q,

AMD

Found GPU 'Device 687f' with 8176 Gb of global memory, max 1024 threads per block, and 64
multiprocessors
TS:25  np_i:500000  np_e:500000  dphi:1.4
TS:50  np_i:500000  np_e:500000  dphi:4.17
TS:75  np_i:500000  np_e:500000  dphi:5.82
TS:100  np_i:500000  np_e:500000  dphi:5.02
TS:125  np_i:500000  np_e:500000  dphi:4.06

...

TS:9925  np_i:500000  np_e:500000  dphi:6.85
TS:9950  np_i:500000  np_e:500000  dphi:5.66
TS:9975  np_i:500000  np_e:500000  dphi:5.46
TS:10000  np_i:500000  np_e:500000  dphi:5.42
Time per time step: 155 ms

Nvidia

Found GPU 'GeForce GTX 1080 with Max-Q Design' with 8114.44 Gb of global memory, max 1024 threads
per block, and 20 multiprocessors
TS:25  np_i:500000  np_e:500000  dphi:1.38
TS:50  np_i:500000  np_e:500000  dphi:4.19
TS:75  np_i:500000  np_e:500000  dphi:5.93
TS:100  np_i:500000  np_e:500000 dphi:5.05
TS:125  np_i:500000  np_e:500000  dphi:4.04

...

TS:9925  np_i:500000  np_e:500000  dphi:6.85
TS:9950  np_i:500000  np_e:500000  dphi:5.77
TS:9975  np_i:500000  np_e:500000  dphi:5.41
TS:10000  np_i:500000  np_e:500000  dphi:5.4
Time per time step: 4.32 ms

both versions which are as identical as can be can be found here: https://github.com/PhilipDeegan/PHARE/blob/gpu/tools/hw/gpu/bench/sheath-gpu.hip.cpp https://github.com/PhilipDeegan/PHARE/blob/gpu/tools/hw/gpu/bench/sheath-gpu.cuda.cpp

PhilipDeegan avatar Mar 09 '21 13:03 PhilipDeegan

I should add the RX Vega testing was done inside docker so the comparisons are not exactly identical

RX Vega machine: CPU: AMD Ryzen 9 3950X 16-Core Processor Host OS: Debian GNU/Linux bullseye/sid Kernel: Linux fad 5.10.0-3-amd64 #1 SMP Debian 5.10.13-1 (2021-02-06) x86_64 GNU/Linux Container OS: Ubuntu 20.04.2 LTS

1080 Max-Q machine CPU: Intel(R) Core(TM) i7-7820HK CPU @ 2.90GHz Host OS: Ubuntu 20.04.2 LTS Kernel: Linux feh 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

I'm happy to be told this is a problem with my setup, but if it's not, then maybe you'd like to know that too.

/opt/rocm/bin/hipconfig --full

HIP version  : 4.0.20496-4f163c68

== hipconfig
HIP_PATH     : /opt/rocm-4.0.0/hip
ROCM_PATH    : /opt/rocm-4.0.0
HIP_COMPILER : clang
HIP_PLATFORM : hcc
HIP_RUNTIME  : ROCclr
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__=  -I/opt/rocm-4.0.0/hip/include -I/opt/rocm-4.0.0/llvm/bin/../lib/clang/12.0.0 -I/opt/rocm-4.0.0/hsa/include -D__HIP_ROCclr__

== hip-clang
HSA_PATH         : /opt/rocm-4.0.0/hsa
HIP_CLANG_PATH   : /opt/rocm-4.0.0/llvm/bin
clang version 12.0.0 (/src/external/llvm-project/clang dac2bfceaa8d4a90257dc8a6d58f268e172ce00e)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-4.0.0/llvm/bin
LLVM (http://llvm.org/):
  LLVM version 12.0.0git
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: znver2

  Registered Targets:
    amdgcn - AMD GCN GPUs
    r600   - AMD GPUs HD2XXX-HD6XXX
    x86    - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags : -D__HIP_ROCclr__ -std=c++11 -isystem /opt/rocm-4.0.0/llvm/lib/clang/12.0.0/include/.. -isystem /opt/rocm-4.0.0/hsa/include -D__HIP_ROCclr__ -isystem /opt/rocm-4.0.0/hip/include -O3
hip-clang-ldflags  :  -L/opt/rocm-4.0.0/hip/lib -O3 -lgcc_s -lgcc -lpthread -lm

=== Environment Variables
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

== Linux Kernel
Hostname     : 4ac74b05b95d
Linux 4ac74b05b95d 5.10.0-3-amd64 #1 SMP Debian 5.10.13-1 (2021-02-06) x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.2 LTS
Release:	20.04
Codename:	focal

PhilipDeegan avatar Mar 09 '21 13:03 PhilipDeegan

Try 256 threads per block. In general you have to profile each kernel and report the slowest to the compiler.

gandryey avatar Mar 09 '21 14:03 gandryey

Try 256 threads per block

it's already configured to 256 threads per block https://github.com/PhilipDeegan/PHARE/blob/gpu/tools/hw/gpu/bench/sheath-gpu.hip.cpp#L51

In general you have to profile each kernel and report the slowest to the compiler.

can you expand on this please?

PhilipDeegan avatar Mar 09 '21 14:03 PhilipDeegan

Use rocprofiler (--hip-trace option), it should produce JSON compatible with chrome://tracing. Then identify where runtime/HW spends the most of time.

gandryey avatar Mar 09 '21 14:03 gandryey

Thank you for the example. I could reproduce the performance issue. If there is a MI100 GPU, please give it a try.

zjin-lcf avatar Sep 13 '21 21:09 zjin-lcf

@PhilipDeegan, Sorry for the lack of response. Please try latest ROCm 6.0.2 (HIP 6.0.32831) to see if your issue still exists? If resolved, please close the ticket. Thanks.

ppanchad-amd avatar Mar 20 '24 15:03 ppanchad-amd

I only have a gaming AMD GPU atm so I cant' really test this any more

but, on my 6900XT, it's like Time per time step: 337 ms on ROCm 6.0.2

PhilipDeegan avatar Mar 20 '24 19:03 PhilipDeegan

@PhilipDeegan Thanks! Internal ticket has been created to investigate this issue.

ppanchad-amd avatar Apr 02 '24 14:04 ppanchad-amd