pynvbench: Need to repeatedly Ctrl-C to stop pynvbench script

Open shwina opened this issue 3 months ago • 1 comments

When running a pynvbench script with multiple benchmarks, if I hit Ctrl-C, it terminates the current benchmark and moves on to the next benchmark (instead of terminating execution altogether).

This looks like:

Run:  [17/20] benchmark_name [Device=1 dtype=float32 Timesteps=2^24]
^CFail: Unexpected error: KeyboardInterrupt: <EMPTY MESSAGE>

At:
  /path/to/benchmark_script.py(399): benchmark_function
  /path/to/benchmark_script.py(427): <module>

Run:  [18/20] benchmark_name [Device=1 dtype=float64 Timesteps=2^24]
^C^CFail: Unexpected error: KeyboardInterrupt: <EMPTY MESSAGE>

At:
  /path/to/benchmark_script.py(399): benchmark_function
  /path/to/benchmark_script.py(427): <module>

Run:  [19/20] benchmark_name [Device=1 dtype=float32 Timesteps=2^28]
^C^C^C^C^C^C^CFail: Unexpected error: KeyboardInterrupt: <EMPTY MESSAGE>

At:
  /path/to/benchmark_script.py(399): benchmark_function
  /path/to/benchmark_script.py(427): <module>

I suspect this is due to a broad catch (C++) or except (Python) statement somewhere that is catching and handling all exceptions, including KeyboardInterrupt - instead of letting the Python interpreter deal with those.

Nov 12 '25 10:11 shwina

There are several issues, it appears.

A pybind11 reproducer without involvement of CUDA seems to work fine:

Stand-alone reproducer

(nvbench) opavlyk@ee09c48-lcedt:~/repos/nvbench/python/tmp$ cat a.cpp 
#include "pybind11/pybind11.h"

#include <cstddef>

namespace py = pybind11;

PYBIND11_MODULE(repr, m) {
   m.def("loop", [](py::object fn, py::object arg, std::size_t n) {
        for (std::size_t i = 0; i < n; ++i) {
            fn(arg);
	}
	py::print("Done!");
   });
}
(nvbench) opavlyk@ee09c48-lcedt:~/repos/nvbench/python/tmp$ cat run.py 
import repr

def inner(arg):
    res = []
    for _ in range(10000):
        res.append(arg)
    return res

x = [42]
repr.loop(inner, x, 2**33)

print("Done!!!!")

(nvbench) opavlyk@ee09c48-lcedt:~/repos/nvbench/python/tmp$ g++ a.cpp -shared -fPIC $(python -m pybind11 --includes) -o repr$(python3-config --extension-suffix)
(nvbench) opavlyk@ee09c48-lcedt:~/repos/nvbench/python/tmp$ 
(nvbench) opavlyk@ee09c48-lcedt:~/repos/nvbench/python/tmp$ python run.py 
^CTraceback (most recent call last):
  File "/home/opavlyk/repos/nvbench/python/tmp/run.py", line 10, in <module>
    repr.loop(inner, x, 2**33)
    ~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/opavlyk/repos/nvbench/python/tmp/run.py", line 6, in inner
    res.append(arg)
    ~~~~~~~~~~^^^^^
KeyboardInterrupt

Use of blocking kernel seems to cause some problems. Pressing Ctrl-C in the middle of execution of a benchmark that uses blocking kernel causes a time-out message to appear.

However, if the same benchmark is executed with state.exec(launcher, sync=True) to disable use of blocking kernel, interrupting skips to execution of the next state in the set of benchmarks:

Behavior of "examples/throughput.py" on interrupts with and without blocking kernel

(nvbench) opavlyk@ee09c48-lcedt:~/repos/nvbench/python/examples/study$ diff with_blocking_krn.py ../throughput.py 
(nvbench) opavlyk@ee09c48-lcedt:~/repos/nvbench/python/examples/study$ diff without_blocking_krn.py ../throughput.py 
68c68
<     state.exec(launcher, sync=True)
---
>     state.exec(launcher)

(nvbench) opavlyk@ee09c48-lcedt:~/repos/nvbench/python/examples/study$ python without_blocking_krn.py -d 0 --min-samples 10000 -a "Stride=1" -a "ItemsPerThread=[1,2]"
# Devices

## [0] `NVIDIA RTX A6000`
* SM Version: 860 (PTX Version: 860)
* Number of SMs: 84
* SM Default Clock Rate: 1800 MHz
* Global Memory: 40827 MiB Free / 48530 MiB Total
* Global Memory Bus Peak: 768 GB/sec (384-bit DDR @8001MHz)
* Max Shared Memory: 100 KiB/SM, 48 KiB/Block
* L2 Cache Size: 6144 KiB
* Maximum Active Blocks: 16/SM
* Maximum Active Threads: 1536/SM, 1024/Block
* Available Registers: 65536/SM, 65536/Block
* ECC Enabled: No

# Log

```
Run:  [1/2] throughput_bench [Device=0 Stride=1 ItemsPerThread=1]
^CFail: Unexpected error: KeyboardInterrupt: 

At:
  /home/opavlyk/repos/nvbench/python/examples/study/without_blocking_krn.py(62): launcher
  /home/opavlyk/repos/nvbench/python/examples/study/without_blocking_krn.py(68): throughput_bench
  /home/opavlyk/repos/nvbench/python/examples/study/without_blocking_krn.py(76): 

Run:  [2/2] throughput_bench [Device=0 Stride=1 ItemsPerThread=2]
^CFail: Unexpected error: KeyboardInterrupt: 

At:
  /home/opavlyk/repos/nvbench/python/examples/study/without_blocking_krn.py(62): launcher
  /home/opavlyk/repos/nvbench/python/examples/study/without_blocking_krn.py(68): throughput_bench
  /home/opavlyk/repos/nvbench/python/examples/study/without_blocking_krn.py(76): 

```

# Benchmark Results

## throughput_bench

### [0] NVIDIA RTX A6000

No data -- check log.

(nvbench) opavlyk@ee09c48-lcedt:~/repos/nvbench/python/examples/study$ python with_blocking_krn.py -d 0 --min-samples 10000 -a "Stride=1" -a "ItemsPerThread=[1,2]"
# Devices

## [0] `NVIDIA RTX A6000`
* SM Version: 860 (PTX Version: 860)
* Number of SMs: 84
* SM Default Clock Rate: 1800 MHz
* Global Memory: 40827 MiB Free / 48530 MiB Total
* Global Memory Bus Peak: 768 GB/sec (384-bit DDR @8001MHz)
* Max Shared Memory: 100 KiB/SM, 48 KiB/Block
* L2 Cache Size: 6144 KiB
* Maximum Active Blocks: 16/SM
* Maximum Active Threads: 1536/SM, 1024/Block
* Available Registers: 65536/SM, 65536/Block
* ECC Enabled: No

# Log

```
Run:  [1/2] throughput_bench [Device=0 Stride=1 ItemsPerThread=1]
Pass: Cold: 0.434290ms GPU, 0.477892ms CPU, 4.34s total GPU, 5.34s total wall, 10001x 
^C^C^C^C^C^C
######################################################################
##################### Possible Deadlock Detected #####################
######################################################################

Forcing unblock: The current measurement appears to have deadlocked
and the results cannot be trusted.

This happens when the KernelLauncher synchronizes the CUDA device.
If this is the case, pass the `sync` exec_tag to the `exec` call:

    state.exec(); // Deadlock
    state.exec(nvbench::exec_tag::sync, ); // Safe

This tells NVBench about the sync so it can run the benchmark safely.

If the KernelLauncher does not synchronize but has a very long 
execution time, this may be a false positive. If so, disable this
check with:

    state.set_blocking_kernel_timeout(-1);

The current timeout is set to 30 seconds.

For more information, see the 'Benchmarks that sync' section of the
NVBench documentation.

If this happens while profiling with an external tool,
pass the `--profile` flag to the executable to disable use of blocking kernel
and to also run the benchmark only once.

For more information, see the 'Benchmark Properties' section of the
NVBench documentation.

Fail: Unexpected error: KeyboardInterrupt: 

At:
  /home/opavlyk/repos/nvbench/python/examples/study/with_blocking_krn.py(62): launcher
  /home/opavlyk/repos/nvbench/python/examples/study/with_blocking_krn.py(68): throughput_bench
  /home/opavlyk/repos/nvbench/python/examples/study/with_blocking_krn.py(76): 

MESSAGE UNAVAILABLE DUE TO EXCEPTION: KeyboardInterrupt: 
Run:  [2/2] throughput_bench [Device=0 Stride=1 ItemsPerThread=2]
Pass: Cold: 0.846243ms GPU, 0.900781ms CPU, 8.48s total GPU, 9.86s total wall, 10016x 
Pass: Batch: 0.847673ms GPU, 8.49s total GPU, 9.52s total wall, 10017x
```

# Benchmark Results

## throughput_bench

### [0] NVIDIA RTX A6000

| Stride | ItemsPerThread | Elements |  Datasize   | Samples |  CPU Time  | Noise |  GPU Time  | Noise | Elem/s  | GlobalMem BW | BWUtil | Samples | Batch GPU  |
|--------|----------------|----------|-------------|---------|------------|-------|------------|-------|---------|--------------|--------|---------|------------|
|      1 |              2 | 33554432 | 128.000 MiB |  10016x | 900.781 us | 6.71% | 846.243 us | 0.64% | 39.651G | 317.209 GB/s | 41.30% |  10017x | 847.673 us |
(nvbench) opavlyk@ee09c48-lcedt:~/repos/nvbench/python/examples/study$

Dec 05 '25 15:12 oleksandr-pavlyk