miniBUDE
miniBUDE copied to clipboard
V2 fails to prevent invalid wgsizes from launching
If we try to launch the benchmark with an non existent kernel WGSIZE, the program actually gives you an invalid result instead of reporting this and terminating early:
miniBUDE:
compile_commands:
- "/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/bin/nvcc -forward-unknown-to-host-compiler -DCUDA -DMEM=MANAGED -DUSE_PPWI="1\\,2\\,4\\,8\\,16\\,32\\,64\\,128" --options-file <OUT>/includes_CUDA.rsp -std=c++17 -forward-unknown-to-host-compiler -arch=sm_61 -use_fast_math -restrict -keep -DNDEBUG -std=c++17 -O3 -march=native -x cu -c <SRC>/main.cpp -o <OUT>/src/main.cpp.o"
vcs:
commit: e7339d6cd9b832f0ba59ed73d2bc406e4345d495*
author: "Tom Lin ([email protected])"
date: "2023-10-02 15:21:22 +0100"
subject: "Prevent NVHPC from optimising away task barrier (likely a bug)"
host_cpu:
~
time: { epoch_s:1698373309, formatted: "Fri Oct 27 02:21:49 2023 GMT" }
deck:
path: "../data/bm1"
poses: 65536
proteins: 938
ligands: 26
forcefields: 34
config:
iterations: 8
poses: 65536
ppwi:
available: [1,2,4,8,16,32,64,128]
selected: [64]
wgsize: [512]
device: { index: 0, name: "NVIDIA TITAN X (Pascal) (12189MB;sm_61)" }
# Device and kernel cc: sm_61
# Verification failed for ppwi=64, wgsize=512; difference exceeded tolerance (0.025%)
# Bad energies (failed/total=58671/65536, showing first 8):
# index,actual,expected,difference_%
# 0,0,865.523,100
# 1,0,25.0715,100
# 2,0,368.434,100
# 3,0,14.6651,100
# 4,0,574.987,100
# 5,0,707.354,100
# 6,0,33.947,100
# 7,0,135.588,100
# (ppwi=64,wgsize=512,valid=0)
results:
- outcome: { valid: false, max_diff_%: 100.000 }
param: { ppwi: 64, wgsize: 512 }
raw_iterations: [3.50847,0.00114,0.00047,0.00039,0.00041,0.00038,0.00036,0.00037,0.00034,0.00039]
context_ms: 0.635100
sum_ms: 0.003
avg_ms: 0.000
min_ms: 0.000
max_ms: 0.000
stddev_ms: 0.000
giga_interactions/s: 4111361.976
gflop/s: 124067012.898
gfinst/s: 102784049.389
energies:
- 0.00
- 0.00
- 0.00
- 0.00
- 0.00
- 0.00
- 0.00
- 0.00
best: { min_ms: 0.00, max_ms: 0.00, sum_ms: 0.00, avg_ms: 0.00, ppwi: 64, wgsize: 512 }
We also need to add a hint in the error such that the missing WGSIZE can be added. Thanks to @jhdavis8 for discovering this.
Update: it's CUDA's wgsize (propagates to threads per blocks) that's failing, PPWI is the one that's define at compile time.