cuda-python [FEA]: Add improved latency test for cuda.bindings benchmarks, add C++ comparison

Is this a duplicate?

[x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

cuda.bindings

Is your feature request related to a problem? Please describe.

With the new release of cuda.bench using nvbench as a backing, we want to utilize the statistical models it employs to more accurately bench the runtime/latency/overhead of our binding calls specifically. Adding C++ comparisons will allow allow quick comparision of cuda.bindings overall performance

Describe the solution you'd like

/benchmarks/ folder that has identical benchmarks of CUDA API's in python through bindings, as well as the raw C++ functions. NVBench wrapping both of these to generate json files that can be compared to view latency differences.

Describe alternatives you've considered

Current implementation uses pytest, which does not offer the same granularity of nvbench.

Additional context

No response

Nov 05 '25 19:11 adii-n

Current implementation uses pytest, which does not offer the same granularity of nvbench.

What does this mean?

Also, nvbench is tailored for the cases where one has custom CUDA kernels that need benchmarking. cuda.bindings has none. How exactly would nvbench help instead of extending what we cover with pytest-benchmark?

Nov 07 '25 18:11 leofang

I think the idea is to use nvbench just to get the extra metrics it provides, it does give some host and device latencies and provides a Python and CPP framework to compare those a bit more apples-to-apples. @adii-n can confirm.

If you think its is to much complexity just for the benchmarks for bindings thats ok, we can just wrap some of the pytest and make it write comparable json files in cpp and python so we can compare.

Nov 10 '25 16:11 danielfrg

I need to find more time to understand the /why/, but last week I was having a lot of trouble getting reproducible results from pytest-benchmarking with reasonable deviation (relative to my traditional tool of choice got host-side benchmarking, pyperf). This is just a note that /whatever/ the right answer here is to maybe compare the accuracy and/or effectiveness of the statistical methods of whatever we choose.

Nov 10 '25 16:11 mdboom

That is good info, for the bindings benchmarks we want host side performance vs cpp right? Maybe we need to run some quick tests comparing a couple of these options?

Nov 10 '25 17:11 danielfrg

I agree, we should throw together some quick prototypes that use different methods of varying complexity/method to see which one gets us closest to our goal. @leofang yes, vanilla nvbench is very specialized around kernels, but it does create an access method for the underlying cupti API that I was planning to use to measure latency

I will change the feature description now to be more general in which method is used, and follow up in a bit with some hopefully pretty plots that help us decide

Nov 12 '25 21:11 adii-n

Here are some initial comparision runs I did based on time.perf_counter() for python and

 auto start = std::chrono::high_resolution_clock::now();
 auto end = std::chrono::high_resolution_clock::now();
 auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);

for C++

Thoughts @mdboom @danielfrg @leofang ? We can probably move forward by modifying the current pytest interface and adding C++ comparisons

Nov 13 '25 21:11 adii-n

@adii-n interesting, thanks! Could you post the benchmark code that you used?

Nov 14 '25 02:11 leofang

@adii-n interesting, thanks! Could you post the benchmark code that you used?

For example, here is the code python/c++ used for context ops measurement


 for _ in range(iterations):
        start = time.perf_counter()
        for _ in range(operations):
            err, device_count = cudaGetDeviceCount()
            err, = cudaDeviceSynchronize()
        end = time.perf_counter()
        times.append((end - start) * 1000 / operations)  # ms per operation

and

    for (int i = 0; i < iterations; ++i) {
        auto start = std::chrono::high_resolution_clock::now();
        
        for (int j = 0; j < operations; ++j) {
            int device_count;
            cudaGetDeviceCount(&device_count);
            cudaDeviceSynchronize();
        }
        
        auto end = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
        double time_per_op = (duration.count() / 1e6) / operations;  // ms per operation
        times.push_back(time_per_op);
    }

The main reason I proposed trying to use nvbench in the first place was because I was worried that using native python/C++ measurement functions (such as the ones used above in the graphs I posted) would introduce latency differences intrinsic to the language itself, that may overshadow the differences of the CUDA functions themselves; however you can see that for some benchmarks (ie large size cudaMemcpy) , they converge to be quite similar, so we can confidently use these methods

Nov 14 '25 03:11 adii-n