LIKWID.jl
LIKWID.jl copied to clipboard
`@nvmon`: Unreasonable metric values
using LIKWID
using CUDA
LIKWID.init_topology_gpu()
N = 10_000
a = 3.141f0 # Float32
x = CUDA.rand(Float32, N)
y = CUDA.rand(Float32, N)
z = CUDA.zeros(Float32, N)
saxpy!(z, a, x, y) = z .= a .* x .+ y
saxpy!(z, a, x, y); # warmup
metrics, events = @nvmon "FLOPS_SP" saxpy!(z, a, x, y);
gives
julia> metrics, events = @nvmon "FLOPS_SP" saxpy!(z, a, x, y);
Group: FLOPS_SP
┌────────────────────────────────────────────────────┬─────────┐
│ Event │ GPU 1 │
├────────────────────────────────────────────────────┼─────────┤
│ SMSP_SASS_THREAD_INST_EXECUTED_OP_FADD_PRED_ON_SUM │ 0.0 │
│ SMSP_SASS_THREAD_INST_EXECUTED_OP_FMUL_PRED_ON_SUM │ 0.0 │
│ SMSP_SASS_THREAD_INST_EXECUTED_OP_FFMA_PRED_ON_SUM │ 10000.0 │
└────────────────────────────────────────────────────┴─────────┘
┌─────────────────────┬────────────┐
│ Metric │ GPU 1 │
├─────────────────────┼────────────┤
│ Runtime (RDTSC) [s] │ 1.84467e10 │
│ SP [MFLOP/s] │ 1.0842e-12 │
└─────────────────────┴────────────┘
The metric values don't make sense!?
Running essentially the same thing under likwid-perfctr
and using the GPU Marker API
# perfctr_gpu.jl
using LIKWID
using CUDA
const N = 10_000
const a = 3.141f0 # Float32
const x = CUDA.rand(Float32, N)
const y = CUDA.rand(Float32, N)
const z = CUDA.zeros(Float32, N)
saxpy!(z,a,x,y) = z .= a .* x .+ y
saxpy!(z,a,x,y) # warmup
GPUMarker.init()
GPUMarker.startregion("saxpy")
saxpy!(z,a,x,y)
GPUMarker.stopregion("saxpy")
GPUMarker.close()
gives
➜ bauerc@dgx-01 LIKWID.jl git:(cb/perfmonrev) likwid-perfctr -G 0 -W FLOPS_SP -m julia --project=. perfctr_gpu.jl
--------------------------------------------------------------------------------
CPU name: AMD EPYC 7742 64-Core Processor
CPU type: AMD K17 (Zen2) architecture
CPU clock: 2.25 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Region saxpy, Group 1: FLOPS_SP
+-------------------+----------+
| Region Info | GPU 0 |
+-------------------+----------+
| RDTSC Runtime [s] | 0.005284 |
| call count | 1 |
+-------------------+----------+
+----------------------------------------------------+---------+-------+
| Event | Counter | GPU 0 |
+----------------------------------------------------+---------+-------+
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FADD_PRED_ON_SUM | GPU0 | 0 |
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FMUL_PRED_ON_SUM | GPU1 | 0 |
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FFMA_PRED_ON_SUM | GPU2 | 5000 |
+----------------------------------------------------+---------+-------+
+---------------------+--------+
| Metric | GPU 0 |
+---------------------+--------+
| Runtime (RDTSC) [s] | 0.0053 |
| SP [MFLOP/s] | 1.8926 |
+---------------------+--------+
That seems much more reasonable. (Why only 5000 FMAs though?)