nvbench
nvbench copied to clipboard
Throughput Failed On Mutiple GPUS
I ran example throughput.cu and it failed on 4XGPU,
Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size), stream)'
Run: [5/8] throughput_bench [Device=0]
Fail: Unexpected error: /data/github/build/cache/nvbench/b2fc/nvbench/detail/l2flush.cuh:55: Cuda API call returned error: cudaErrorInvalidValue: invalid argument
Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size), stream)'
Run: [6/8] throughput_bench [Device=1]
Fail: Unexpected error: /data/github/build/cache/nvbench/b2fc/nvbench/detail/l2flush.cuh:55: Cuda API call returned error: cudaErrorInvalidValue: invalid argument
Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size), stream)'
Run: [7/8] throughput_bench [Device=2]
Fail: Unexpected error: /data/github/build/cache/nvbench/b2fc/nvbench/detail/l2flush.cuh:55: Cuda API call returned error: cudaErrorInvalidValue: invalid argument
Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size), stream)'
Run: [8/8] throughput_bench [Device=3]
Pass: Cold: 0.007061ms GPU, 0.016156ms CPU, 0.50s total GPU, 6.81s total wall, 70816x
Pass: Batch: 0.002299ms GPU, 0.50s total GPU, 0.50s to
I noticed examples/stream.cu that can set_cuda_stream
state.set_cuda_stream(nvbench::make_cuda_stream_view(default_stream));
so I added it to throughput.cu which works fine
# Log
Run: [1/4] throughput_bench [Device=0]
Pass: Cold: 0.663276ms GPU, 0.672594ms CPU, 0.51s total GPU, 0.54s total wall, 768x
Pass: Batch: 0.659212ms GPU, 0.53s total GPU, 0.53s total wall, 800x
Run: [2/4] throughput_bench [Device=1]
Pass: Cold: 0.665058ms GPU, 0.674441ms CPU, 0.50s total GPU, 0.53s total wall, 752x
Pass: Batch: 0.660540ms GPU, 0.54s total GPU, 0.54s total wall, 815x
Run: [3/4] throughput_bench [Device=2]
Pass: Cold: 0.664827ms GPU, 0.674139ms CPU, 0.51s total GPU, 0.55s total wall, 768x
Pass: Batch: 0.660413ms GPU, 0.53s total GPU, 0.53s total wall, 809x
Run: [4/4] throughput_bench [Device=3]
Pass: Cold: 0.665416ms GPU, 0.674786ms CPU, 0.50s total GPU, 0.53s total wall, 752x
Pass: Batch: 0.660745ms GPU, 0.53s total GPU, 0.53s total wall, 807x
For what it is worth, the example "throughput.cu" works out of the box on my desktop with 2 GPUs in it, RTX A6000 and RTX A400 with driver 575.57.08 and CTK 12.9
I also ran the example on a machine with two Tesla V100 cards, running driver 580.95.05 and CTK 13.0 and the example ran fine out of the box.
Could you share more specifics of your setup to help us reproduce the issue?
thx for you reply, I use cuda 12.6 on ubuntu
cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
the nvidia-smi show as below
nvidia-smi
Wed Dec 10 11:14:43 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |
| N/A 34C P0 35W / 70W | 972MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |
| N/A 26C P8 15W / 70W | 4MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |
| N/A 24C P8 15W / 70W | 4MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 24C P8 15W / 70W | 4MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2107 C /usr/local/bin/python3 204MiB |
+---------------------------------------------------------------------------------------+