nvbench Throughput Failed On Mutiple GPUS

I ran example throughput.cu and it failed on 4XGPU,


Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size), stream)'
Run:  [5/8] throughput_bench [Device=0]
Fail: Unexpected error: /data/github/build/cache/nvbench/b2fc/nvbench/detail/l2flush.cuh:55: Cuda API call returned error: cudaErrorInvalidValue: invalid argument
Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size), stream)'
Run:  [6/8] throughput_bench [Device=1]
Fail: Unexpected error: /data/github/build/cache/nvbench/b2fc/nvbench/detail/l2flush.cuh:55: Cuda API call returned error: cudaErrorInvalidValue: invalid argument
Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size), stream)'
Run:  [7/8] throughput_bench [Device=2]
Fail: Unexpected error: /data/github/build/cache/nvbench/b2fc/nvbench/detail/l2flush.cuh:55: Cuda API call returned error: cudaErrorInvalidValue: invalid argument
Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size), stream)'
Run:  [8/8] throughput_bench [Device=3]
Pass: Cold: 0.007061ms GPU, 0.016156ms CPU, 0.50s total GPU, 6.81s total wall, 70816x
Pass: Batch: 0.002299ms GPU, 0.50s total GPU, 0.50s to

I noticed examples/stream.cu that can set_cuda_stream

  state.set_cuda_stream(nvbench::make_cuda_stream_view(default_stream));

so I added it to throughput.cu which works fine

# Log



Run:  [1/4] throughput_bench [Device=0]
Pass: Cold: 0.663276ms GPU, 0.672594ms CPU, 0.51s total GPU, 0.54s total wall, 768x
Pass: Batch: 0.659212ms GPU, 0.53s total GPU, 0.53s total wall, 800x
Run:  [2/4] throughput_bench [Device=1]
Pass: Cold: 0.665058ms GPU, 0.674441ms CPU, 0.50s total GPU, 0.53s total wall, 752x
Pass: Batch: 0.660540ms GPU, 0.54s total GPU, 0.54s total wall, 815x
Run:  [3/4] throughput_bench [Device=2]
Pass: Cold: 0.664827ms GPU, 0.674139ms CPU, 0.51s total GPU, 0.55s total wall, 768x
Pass: Batch: 0.660413ms GPU, 0.53s total GPU, 0.53s total wall, 809x
Run:  [4/4] throughput_bench [Device=3]
Pass: Cold: 0.665416ms GPU, 0.674786ms CPU, 0.50s total GPU, 0.53s total wall, 752x
Pass: Batch: 0.660745ms GPU, 0.53s total GPU, 0.53s total wall, 807x

Nov 23 '25 13:11 westfly

For what it is worth, the example "throughput.cu" works out of the box on my desktop with 2 GPUs in it, RTX A6000 and RTX A400 with driver 575.57.08 and CTK 12.9

I also ran the example on a machine with two Tesla V100 cards, running driver 580.95.05 and CTK 13.0 and the example ran fine out of the box.

Could you share more specifics of your setup to help us reproduce the issue?

Dec 08 '25 14:12 oleksandr-pavlyk

thx for you reply, I use cuda 12.6 on ubuntu

cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

the nvidia-smi show as below

nvidia-smi
Wed Dec 10 11:14:43 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1B.0 Off |                    0 |
| N/A   34C    P0              35W /  70W |    972MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       On  | 00000000:00:1C.0 Off |                    0 |
| N/A   26C    P8              15W /  70W |      4MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla T4                       On  | 00000000:00:1D.0 Off |                    0 |
| N/A   24C    P8              15W /  70W |      4MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   24C    P8              15W /  70W |      4MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2107      C   /usr/local/bin/python3                      204MiB |
+---------------------------------------------------------------------------------------+

Dec 10 '25 11:12 westfly