LIKWID.jl icon indicating copy to clipboard operation
LIKWID.jl copied to clipboard

`@nvmon`: Unreasonable metric values

Open carstenbauer opened this issue 2 years ago • 1 comments

using LIKWID
using CUDA
LIKWID.init_topology_gpu()

N = 10_000
a = 3.141f0 # Float32
x = CUDA.rand(Float32, N)
y = CUDA.rand(Float32, N)
z = CUDA.zeros(Float32, N)

saxpy!(z, a, x, y) = z .= a .* x .+ y
saxpy!(z, a, x, y); # warmup

metrics, events = @nvmon "FLOPS_SP" saxpy!(z, a, x, y);

gives

julia> metrics, events = @nvmon "FLOPS_SP" saxpy!(z, a, x, y);

Group: FLOPS_SP
┌────────────────────────────────────────────────────┬─────────┐
│                                              Event │   GPU 1 │
├────────────────────────────────────────────────────┼─────────┤
│ SMSP_SASS_THREAD_INST_EXECUTED_OP_FADD_PRED_ON_SUM │     0.0 │
│ SMSP_SASS_THREAD_INST_EXECUTED_OP_FMUL_PRED_ON_SUM │     0.0 │
│ SMSP_SASS_THREAD_INST_EXECUTED_OP_FFMA_PRED_ON_SUM │ 10000.0 │
└────────────────────────────────────────────────────┴─────────┘
┌─────────────────────┬────────────┐
│              Metric │      GPU 1 │
├─────────────────────┼────────────┤
│ Runtime (RDTSC) [s] │ 1.84467e10 │
│        SP [MFLOP/s] │ 1.0842e-12 │
└─────────────────────┴────────────┘

The metric values don't make sense!?

carstenbauer avatar Jun 29 '22 11:06 carstenbauer

Running essentially the same thing under likwid-perfctr and using the GPU Marker API

# perfctr_gpu.jl
using LIKWID
using CUDA

const N = 10_000
const a = 3.141f0 # Float32
const x = CUDA.rand(Float32, N)
const y = CUDA.rand(Float32, N)
const z = CUDA.zeros(Float32, N)

saxpy!(z,a,x,y) = z .= a .* x .+ y
saxpy!(z,a,x,y) # warmup

GPUMarker.init()
GPUMarker.startregion("saxpy")
saxpy!(z,a,x,y)
GPUMarker.stopregion("saxpy")
GPUMarker.close()

gives

➜  bauerc@dgx-01 LIKWID.jl git:(cb/perfmonrev)  likwid-perfctr -G 0 -W FLOPS_SP -m julia --project=. perfctr_gpu.jl
--------------------------------------------------------------------------------
CPU name:       AMD EPYC 7742 64-Core Processor                
CPU type:       AMD K17 (Zen2) architecture
CPU clock:      2.25 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Region saxpy, Group 1: FLOPS_SP
+-------------------+----------+
|    Region Info    |   GPU 0  |
+-------------------+----------+
| RDTSC Runtime [s] | 0.005284 |
|     call count    |        1 |
+-------------------+----------+

+----------------------------------------------------+---------+-------+
|                        Event                       | Counter | GPU 0 |
+----------------------------------------------------+---------+-------+
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FADD_PRED_ON_SUM |   GPU0  |     0 |
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FMUL_PRED_ON_SUM |   GPU1  |     0 |
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FFMA_PRED_ON_SUM |   GPU2  |  5000 |
+----------------------------------------------------+---------+-------+

+---------------------+--------+
|        Metric       |  GPU 0 |
+---------------------+--------+
| Runtime (RDTSC) [s] | 0.0053 |
|     SP [MFLOP/s]    | 1.8926 |
+---------------------+--------+

That seems much more reasonable. (Why only 5000 FMAs though?)

carstenbauer avatar Jun 29 '22 12:06 carstenbauer