nvbench
nvbench copied to clipboard
CUDA Kernel Benchmarking Library
The current implementation computes the throughput statistics in `measure_cold`, which is invoked during `state.exec`. This has the undesirable effect that throughput statistics are not generated when reads/writes are declared after...
Now the option parser throws an exception if any parameters don't match corresponding stopping criterions. This PR addresses issue #153 .
This PR is replacing the `VAULT_HOST` variable with `AWS_ROLE_ARN`. This is required to use the new token service to get AWS credentials.
nvbench is a great tool for generating profiles for libcudf. I've found that the `--profile` option with `--run-once` was a good starting point, but for many operations we need more...
I was trying to compare benchmark results for A100 PCI and A100 SXM, but nvbench refused with: ``` nvbench_compare.py ./babelstream_fallback_blocks_A100_PCI/ ./babelstream_fallback_blocks_A100_SXM/ ['./babelstream_fallback_blocks_A100_PCI/', './babelstream_fallback_blocks_A100_SXM/'] Device sections do not match. ``` I...
CUDA events suffer from low accuracy and include the kernel launch overhead. On the other hand, CUPTI provides a more reliable way to get consistent timing measurement. This request asks...
I have ran into this twice now and thought it would be great if an nvbench-based benchmark could create any intermediate directories for the output JSON file. Now, with the...
I attempted to build nvbench with the following setup and was faced with compiler errors. - nvbench commit: a171514056e5d6a7f52a035dd6c812fa301d4f4f (latest commit to main) - nvcc: Cuda compilation tools, release 11.5,...
I follow the instructions in the readme to build examples "cmake -DNVBench_ENABLE_EXAMPLES=ON -DCMAKE_CUDA_ARCHITECTURES=80 .. && make", but when I run the examples with "./nvbench.example.cpp20.axes" or "./nvbench.example.cpp17.axes", I get the error...
Existing nvbench allows to measure SOL for memory bound workloads by providing ``` state.addGlobalMemoryReads(nbytes) state.addGlobalMemoryWrites(nbytes) ``` It would be useful to extend this concept to provide flops such that we...