nvbench
nvbench copied to clipboard
CUDA Kernel Benchmarking Library
When I run an nvbench-based benchmark, like: ``` thrust.bench.transform.basic.base -d 0 --stopping-criterion entropy --json baseline.json ``` it prints a nice markdown report at the end, summarizing the run benchmarks, times,...
When NVBench is consumed as a dependency (for example, from CCCL), cmake still checks for a bunch of warning flags: ``` -- Performing Test NVBench_CXX_FLAG__Wall -- Performing Test NVBench_CXX_FLAG__Wall -...
Set underlying type for `enum class exec_tag` to `::cuda::std::uint16_t`, rather than the 32-bit default integral type. This change reduces size of exec_tag instance from 4 bytes to 2 bytes, it...
In python kernel generator is a user-defined callable. We need to capture Python object of that callable in kernel generator provided for each benchmark. To this end, nvbench::benchmark has been...
Using Python API to `nvbench` one could - use Python as benchmark driver - analyze benchmark data in Python as they are collected - profile kernels authored in Python, such...
NVBench currently allows kernel profiling with external tools with the flag `--profile`. On the other hand, profiling tools collect all activities in the benchmark, including "setup/initialization" kernels that are not...
Recently when benchmarking libcudf on a DGX system, I ran into an issue where the MR setup by libcudf would only be respected by nvbench on GPU0. We observed that...
Introduce a global configuration object that allows process-wide settings, starting with separate decimal (GB) vs base2 (GiB) options for bandwidth and memory size. Expose programmatic interfaces and CLI options to...
Based on the output of nvbench, it looks like the global memory bandwidth and bandwidth utilization are calculated based on GPU Time. Below is an excerpt from a recent benchmark....
I ran example [throughput.cu](https://github.com/NVIDIA/nvbench/blob/main/examples/throughput.cu) and it failed on 4XGPU, ```shell Command: 'cudaMemsetAsync(m_l2_buffer, 0, static_cast(m_l2_size), stream)' Run: [5/8] throughput_bench [Device=0] Fail: Unexpected error: /data/github/build/cache/nvbench/b2fc/nvbench/detail/l2flush.cuh:55: Cuda API call returned error: cudaErrorInvalidValue: invalid...