Make python nvbench benchmarks interruptible
This PR adds try/catch around invocation of Python function that defines benchmark function. If execution of benchmark for some input parameters resulted in Python exception being raised, it gets propagated as nvbench::stop_iteration_loop exception, also added in this PR.
The nvbench::runner was modified to skip all outstanding benchmark configurations if nvbench::stop_iterator_loop exception was caught.
Screenshot of new behavior under Ctrl-C
(nvbench) opavlyk@ee09c48-lcedt:~/repos/nvbench/python/examples/study$ python without_blocking_krn.py -d 0 -a "Stride=[1,2]" -a "ItemsPerThread=[1,2,4]" # Devices ## [0] `NVIDIA RTX A6000` * SM Version: 860 (PTX Version: 860) * Number of SMs: 84 * SM Default Clock Rate: 1800 MHz * Global Memory: 38847 MiB Free / 48530 MiB Total * Global Memory Bus Peak: 768 GB/sec (384-bit DDR @8001MHz) * Max Shared Memory: 100 KiB/SM, 48 KiB/Block * L2 Cache Size: 6144 KiB * Maximum Active Blocks: 16/SM * Maximum Active Threads: 1536/SM, 1024/Block * Available Registers: 65536/SM, 65536/Block * ECC Enabled: No # Log ``` Run: [1/6] throughput_bench [Device=0 Stride=1 ItemsPerThread=1] Pass: Cold: 0.459351ms GPU, 0.477780ms CPU, 0.51s total GPU, 0.57s total wall, 1104x Run: [2/6] throughput_bench [Device=0 Stride=2 ItemsPerThread=1] Pass: Cold: 0.615677ms GPU, 0.635224ms CPU, 0.81s total GPU, 0.90s total wall, 1312x Run: [3/6] throughput_bench [Device=0 Stride=1 ItemsPerThread=2] ^CFail: Unexpected error: KeyboardInterrupt:At: /home/opavlyk/repos/nvbench/python/examples/study/without_blocking_krn.py(62): launcher /home/opavlyk/repos/nvbench/python/examples/study/without_blocking_krn.py(68): throughput_bench /home/opavlyk/repos/nvbench/python/examples/study/without_blocking_krn.py(76): Run: [4/6] throughput_bench [Device=0 Stride=2 ItemsPerThread=2] Skip: Run: [5/6] throughput_bench [Device=0 Stride=1 ItemsPerThread=4] Skip: Run: [6/6] throughput_bench [Device=0 Stride=2 ItemsPerThread=4] Skip: ``` # Benchmark Results ## throughput_bench ### [0] NVIDIA RTX A6000 | Stride | ItemsPerThread | Elements | Datasize | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil | |--------|----------------|----------|-------------|---------|------------|--------|------------|-------|---------|--------------|--------| | 1 | 1 | 33554432 | 128.000 MiB | 1104x | 477.780 us | 12.29% | 459.351 us | 6.16% | 73.047G | 584.380 GB/s | 76.08% | | 2 | 1 | 33554432 | 128.000 MiB | 1312x | 635.224 us | 9.54% | 615.677 us | 3.31% | 54.500G | 436.001 GB/s | 56.76% | | 2 | 2 | | | | | | | | | | | | 1 | 4 | | | | | | | | | | | | 2 | 4 | | | | | | | | | | | (nvbench) opavlyk@ee09c48-lcedt:~/repos/nvbench/python/examples/study$
One caveat is that pressing Ctrl-C during execution of benchmarks that use blocking kernel, the script does not get interrupted right away.
Script continues execution until the blocking kernel times out and cudaSynchronizeStream(); called by libnvbench completes. User sees "Deadlock detected" warning output produced, the KeyboardInterrupted Python exception printed, and the remaining benchmarks skipped.
Screenshot of new behavior under Ctrl-C
(nvbench) opavlyk@ee09c48-lcedt:~/repos/nvbench/python/examples/study$ python with_blocking_krn.py -d 0 -a "Stride=[1,2]" -a "ItemsPerThread=[1,2,4]"
# Devices
## [0] `NVIDIA RTX A6000`
* SM Version: 860 (PTX Version: 860)
* Number of SMs: 84
* SM Default Clock Rate: 1800 MHz
* Global Memory: 38821 MiB Free / 48530 MiB Total
* Global Memory Bus Peak: 768 GB/sec (384-bit DDR @8001MHz)
* Max Shared Memory: 100 KiB/SM, 48 KiB/Block
* L2 Cache Size: 6144 KiB
* Maximum Active Blocks: 16/SM
* Maximum Active Threads: 1536/SM, 1024/Block
* Available Registers: 65536/SM, 65536/Block
* ECC Enabled: No
# Log
```
Run: [1/6] throughput_bench [Device=0 Stride=1 ItemsPerThread=1]
Pass: Cold: 0.430871ms GPU, 0.470829ms CPU, 0.74s total GPU, 0.90s total wall, 1712x
Pass: Batch: 0.468832ms GPU, 0.80s total GPU, 0.85s total wall, 1713x
Run: [2/6] throughput_bench [Device=0 Stride=2 ItemsPerThread=1]
Pass: Cold: 0.588420ms GPU, 0.632322ms CPU, 0.51s total GPU, 0.60s total wall, 864x
Pass: Batch: 0.648875ms GPU, 0.56s total GPU, 0.57s total wall, 865x
Run: [3/6] throughput_bench [Device=0 Stride=1 ItemsPerThread=2]
Pass: Cold: 0.847039ms GPU, 0.892486ms CPU, 0.50s total GPU, 0.57s total wall, 592x
^C
######################################################################
##################### Possible Deadlock Detected #####################
######################################################################
Forcing unblock: The current measurement appears to have deadlocked
and the results cannot be trusted.
This happens when the KernelLauncher synchronizes the CUDA device.
If this is the case, pass the `sync` exec_tag to the `exec` call:
state.exec(); // Deadlock
state.exec(nvbench::exec_tag::sync, ); // Safe
This tells NVBench about the sync so it can run the benchmark safely.
If the KernelLauncher does not synchronize but has a very long
execution time, this may be a false positive. If so, disable this
check with:
state.set_blocking_kernel_timeout(-1);
The current timeout is set to 30 seconds.
For more information, see the 'Benchmarks that sync' section of the
NVBench documentation.
If this happens while profiling with an external tool,
pass the `--profile` flag to the executable to disable use of blocking kernel
and to also run the benchmark only once.
For more information, see the 'Benchmark Properties' section of the
NVBench documentation.
Fail: Unexpected error: KeyboardInterrupt:
At:
/home/opavlyk/repos/nvbench/python/examples/study/with_blocking_krn.py(62): launcher
/home/opavlyk/repos/nvbench/python/examples/study/with_blocking_krn.py(68): throughput_bench
/home/opavlyk/repos/nvbench/python/examples/study/with_blocking_krn.py(76):
Run: [4/6] throughput_bench [Device=0 Stride=2 ItemsPerThread=2]
Skip:
Run: [5/6] throughput_bench [Device=0 Stride=1 ItemsPerThread=4]
Skip:
Run: [6/6] throughput_bench [Device=0 Stride=2 ItemsPerThread=4]
Skip:
```
# Benchmark Results
## throughput_bench
### [0] NVIDIA RTX A6000
| Stride | ItemsPerThread | Elements | Datasize | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil | Samples | Batch GPU |
|--------|----------------|----------|-------------|---------|------------|-------|------------|-------|---------|--------------|--------|---------|------------|
| 1 | 1 | 33554432 | 128.000 MiB | 1712x | 470.829 us | 6.81% | 430.871 us | 0.59% | 77.876G | 623.007 GB/s | 81.11% | 1713x | 468.832 us |
| 2 | 1 | 33554432 | 128.000 MiB | 864x | 632.322 us | 4.91% | 588.420 us | 0.67% | 57.025G | 456.197 GB/s | 59.39% | 865x | 648.875 us |
| 2 | 2 | | | | | | | | | | | | |
| 1 | 4 | | | | | | | | | | | | |
| 2 | 4 | | | | | | | | | | | | |
(nvbench) opavlyk@ee09c48-lcedt:~/repos/nvbench/python/examples/study$
Closes #284