HIP Drastic difference in execution time between CUDA and HIP versions of the same code

Drastic difference in execution time between CUDA and HIP versions of the same code

Open PhilipDeegan opened this issue 3 years ago • 8 comments

Hi there,

I've tried running this: https://www.particleincell.com/wp-content/uploads/2016/02/sheath-gpu.cu reference: https://www.particleincell.com/2016/cuda-pic On AMD Radeon RX Vega 64, and a Nvidia 1080 Max-Q,

AMD

Found GPU 'Device 687f' with 8176 Gb of global memory, max 1024 threads per block, and 64
multiprocessors
TS:25  np_i:500000  np_e:500000  dphi:1.4
TS:50  np_i:500000  np_e:500000  dphi:4.17
TS:75  np_i:500000  np_e:500000  dphi:5.82
TS:100  np_i:500000  np_e:500000  dphi:5.02
TS:125  np_i:500000  np_e:500000  dphi:4.06

...

TS:9925  np_i:500000  np_e:500000  dphi:6.85
TS:9950  np_i:500000  np_e:500000  dphi:5.66
TS:9975  np_i:500000  np_e:500000  dphi:5.46
TS:10000  np_i:500000  np_e:500000  dphi:5.42
Time per time step: 155 ms

Nvidia

Found GPU 'GeForce GTX 1080 with Max-Q Design' with 8114.44 Gb of global memory, max 1024 threads
per block, and 20 multiprocessors
TS:25  np_i:500000  np_e:500000  dphi:1.38
TS:50  np_i:500000  np_e:500000  dphi:4.19
TS:75  np_i:500000  np_e:500000  dphi:5.93
TS:100  np_i:500000  np_e:500000 dphi:5.05
TS:125  np_i:500000  np_e:500000  dphi:4.04

...

TS:9925  np_i:500000  np_e:500000  dphi:6.85
TS:9950  np_i:500000  np_e:500000  dphi:5.77
TS:9975  np_i:500000  np_e:500000  dphi:5.41
TS:10000  np_i:500000  np_e:500000  dphi:5.4
Time per time step: 4.32 ms

both versions which are as identical as can be can be found here: https://github.com/PhilipDeegan/PHARE/blob/gpu/tools/hw/gpu/bench/sheath-gpu.hip.cpp https://github.com/PhilipDeegan/PHARE/blob/gpu/tools/hw/gpu/bench/sheath-gpu.cuda.cpp

Mar 09 '21 13:03 PhilipDeegan

I should add the RX Vega testing was done inside docker so the comparisons are not exactly identical

RX Vega machine: CPU: AMD Ryzen 9 3950X 16-Core Processor Host OS: Debian GNU/Linux bullseye/sid Kernel: Linux fad 5.10.0-3-amd64 #1 SMP Debian 5.10.13-1 (2021-02-06) x86_64 GNU/Linux Container OS: Ubuntu 20.04.2 LTS

1080 Max-Q machine CPU: Intel(R) Core(TM) i7-7820HK CPU @ 2.90GHz Host OS: Ubuntu 20.04.2 LTS Kernel: Linux feh 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

I'm happy to be told this is a problem with my setup, but if it's not, then maybe you'd like to know that too.

/opt/rocm/bin/hipconfig --full

HIP version  : 4.0.20496-4f163c68

== hipconfig
HIP_PATH     : /opt/rocm-4.0.0/hip
ROCM_PATH    : /opt/rocm-4.0.0
HIP_COMPILER : clang
HIP_PLATFORM : hcc
HIP_RUNTIME  : ROCclr
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__=  -I/opt/rocm-4.0.0/hip/include -I/opt/rocm-4.0.0/llvm/bin/../lib/clang/12.0.0 -I/opt/rocm-4.0.0/hsa/include -D__HIP_ROCclr__

== hip-clang
HSA_PATH         : /opt/rocm-4.0.0/hsa
HIP_CLANG_PATH   : /opt/rocm-4.0.0/llvm/bin
clang version 12.0.0 (/src/external/llvm-project/clang dac2bfceaa8d4a90257dc8a6d58f268e172ce00e)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-4.0.0/llvm/bin
LLVM (http://llvm.org/):
  LLVM version 12.0.0git
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: znver2

  Registered Targets:
    amdgcn - AMD GCN GPUs
    r600   - AMD GPUs HD2XXX-HD6XXX
    x86    - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags : -D__HIP_ROCclr__ -std=c++11 -isystem /opt/rocm-4.0.0/llvm/lib/clang/12.0.0/include/.. -isystem /opt/rocm-4.0.0/hsa/include -D__HIP_ROCclr__ -isystem /opt/rocm-4.0.0/hip/include -O3
hip-clang-ldflags  :  -L/opt/rocm-4.0.0/hip/lib -O3 -lgcc_s -lgcc -lpthread -lm

=== Environment Variables
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

== Linux Kernel
Hostname     : 4ac74b05b95d
Linux 4ac74b05b95d 5.10.0-3-amd64 #1 SMP Debian 5.10.13-1 (2021-02-06) x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.2 LTS
Release:	20.04
Codename:	focal

Mar 09 '21 13:03 PhilipDeegan

Try 256 threads per block. In general you have to profile each kernel and report the slowest to the compiler.

Mar 09 '21 14:03 gandryey

Try 256 threads per block

it's already configured to 256 threads per block https://github.com/PhilipDeegan/PHARE/blob/gpu/tools/hw/gpu/bench/sheath-gpu.hip.cpp#L51

In general you have to profile each kernel and report the slowest to the compiler.

can you expand on this please?

Mar 09 '21 14:03 PhilipDeegan

Use rocprofiler (--hip-trace option), it should produce JSON compatible with chrome://tracing. Then identify where runtime/HW spends the most of time.

Mar 09 '21 14:03 gandryey

Thank you for the example. I could reproduce the performance issue. If there is a MI100 GPU, please give it a try.

Sep 13 '21 21:09 zjin-lcf

@PhilipDeegan, Sorry for the lack of response. Please try latest ROCm 6.0.2 (HIP 6.0.32831) to see if your issue still exists? If resolved, please close the ticket. Thanks.

Mar 20 '24 15:03 ppanchad-amd

I only have a gaming AMD GPU atm so I cant' really test this any more

but, on my 6900XT, it's like Time per time step: 337 ms on ROCm 6.0.2

Mar 20 '24 19:03 PhilipDeegan

@PhilipDeegan Thanks! Internal ticket has been created to investigate this issue.

Apr 02 '24 14:04 ppanchad-amd

HIP HIP copied to clipboard

Drastic difference in execution time between CUDA and HIP versions of the same code

HIP
HIP copied to clipboard