[FEA]: Add improved latency test for cuda.bindings benchmarks, add C++ comparison
Is this a duplicate?
- [x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct
Area
cuda.bindings
Is your feature request related to a problem? Please describe.
With the new release of cuda.bench using nvbench as a backing, we want to utilize the statistical models it employs to more accurately bench the runtime/latency/overhead of our binding calls specifically. Adding C++ comparisons will allow allow quick comparision of cuda.bindings overall performance
Describe the solution you'd like
/benchmarks/ folder that has identical benchmarks of CUDA API's in python through bindings, as well as the raw C++ functions. NVBench wrapping both of these to generate json files that can be compared to view latency differences.
Describe alternatives you've considered
Current implementation uses pytest, which does not offer the same granularity of nvbench.
Additional context
No response
Current implementation uses pytest, which does not offer the same granularity of nvbench.
What does this mean?
Also, nvbench is tailored for the cases where one has custom CUDA kernels that need benchmarking. cuda.bindings has none. How exactly would nvbench help instead of extending what we cover with pytest-benchmark?
I think the idea is to use nvbench just to get the extra metrics it provides, it does give some host and device latencies and provides a Python and CPP framework to compare those a bit more apples-to-apples. @adii-n can confirm.
If you think its is to much complexity just for the benchmarks for bindings thats ok, we can just wrap some of the pytest and make it write comparable json files in cpp and python so we can compare.
I need to find more time to understand the /why/, but last week I was having a lot of trouble getting reproducible results from pytest-benchmarking with reasonable deviation (relative to my traditional tool of choice got host-side benchmarking, pyperf). This is just a note that /whatever/ the right answer here is to maybe compare the accuracy and/or effectiveness of the statistical methods of whatever we choose.
That is good info, for the bindings benchmarks we want host side performance vs cpp right? Maybe we need to run some quick tests comparing a couple of these options?
I agree, we should throw together some quick prototypes that use different methods of varying complexity/method to see which one gets us closest to our goal. @leofang yes, vanilla nvbench is very specialized around kernels, but it does create an access method for the underlying cupti API that I was planning to use to measure latency
I will change the feature description now to be more general in which method is used, and follow up in a bit with some hopefully pretty plots that help us decide
Here are some initial comparision runs I did based on time.perf_counter() for python and
auto start = std::chrono::high_resolution_clock::now();
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
for C++
Thoughts @mdboom @danielfrg @leofang ? We can probably move forward by modifying the current pytest interface and adding C++ comparisons
@adii-n interesting, thanks! Could you post the benchmark code that you used?
@adii-n interesting, thanks! Could you post the benchmark code that you used?
For example, here is the code python/c++ used for context ops measurement
for _ in range(iterations):
start = time.perf_counter()
for _ in range(operations):
err, device_count = cudaGetDeviceCount()
err, = cudaDeviceSynchronize()
end = time.perf_counter()
times.append((end - start) * 1000 / operations) # ms per operation
and
for (int i = 0; i < iterations; ++i) {
auto start = std::chrono::high_resolution_clock::now();
for (int j = 0; j < operations; ++j) {
int device_count;
cudaGetDeviceCount(&device_count);
cudaDeviceSynchronize();
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
double time_per_op = (duration.count() / 1e6) / operations; // ms per operation
times.push_back(time_per_op);
}
The main reason I proposed trying to use nvbench in the first place was because I was worried that using native python/C++ measurement functions (such as the ones used above in the graphs I posted) would introduce latency differences intrinsic to the language itself, that may overshadow the differences of the CUDA functions themselves; however you can see that for some benchmarks (ie large size cudaMemcpy) , they converge to be quite similar, so we can confidently use these methods