TensorRT windows TensorRT inference time fluctuates greatly in some gpu drivers

Environment

TensorRT Version: 8.4.0.6 or 8.4.1.5 NVIDIA GPU: T600 or RTX 3060 NVIDIA Driver Version: 511.65 or 512.96 CUDA Version: 11.6 or 11.3 CUDNN Version: 8.x Operating System: win10 or win 11

I have mentioned the same issue： https://github.com/NVIDIA/TensorRT/issues/1977#issue-1233391613

First, on the development computer, the 511.65 driver is pre-installed. TRT8.4.0, TRT 8.4.1, CUDA11.3, CUDA11.6 have been used. The two methods of counting time mentioned here have been used https://github.com/NVIDIA/TensorRT/issues/1977#issuecomment-1128539745. Inference times fluctuate wildly as we've seen here https://github.com/NVIDIA/TensorRT/issues/1977#issue-1233391613.

Second, On the development computer, I reduced the graphics driver to 473.47. The Inference time is very stable.

Third, On the new computer(win11, rtx 3060) where the product is deployed, the 512.96 drive is pre-installed. Inference times still fluctuate wildly.

What is the reason for this? Is the new driver more unstable? After the product is released, the user's graphics card driver version cannot be controlled, how to solve it ?

Aug 10 '22 01:08 wang7393

@nvpohanh ^ ^

Aug 10 '22 14:08 zerollzeng

Could you share the Nsight Systems profiles with "--gpu-metrics-device all" flag of both the stable run and the unstable run? I am suspecting a driver issue in r495.07+ which might have caused this. (internal tracker id 3634385)

Aug 11 '22 01:08 nvpohanh

@nvpohanh Sorry, could you please provide more detailed usage

Aug 11 '22 02:08 wang7393

Could you follow the instructions in the "Run Nsight Systems with trtexec" section in https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#nvprof and share the output "foo_profile" file? Please add "--gpu-metrics-device all" to the nsys command on top of the existing one.

Aug 11 '22 03:08 nvpohanh

@nvpohanh profile.zip

Aug 11 '22 07:08 wang7393

@nvpohanh Do you have a conclusion? My project is in a hurry. If the driver version is greater than R495.07, there will be a huge fluctuation in inference time. The newer graphics cards on Windows 11 (such as RTX3060) are not compatible, how to fix this? Does Nvidia have plans to solve this issue in the new driver? Thanks.

Aug 12 '22 03:08 wang7393

I checked our nsys-rep files but it doesn't seem to be caused by the driver bug I mentioned.

Could you try adding "--useCudaGraph --noDataTransfers --useSpinWait" flags and see if that makes runtime more stable?

Aug 12 '22 03:08 nvpohanh

@nvpohanh I generate trt models at runtime, that is, using c++ code to do model transformation and inference. How do I set this up?

Aug 12 '22 06:08 wang7393

oh I see. So you were not using trtexec. Could you share your code about how you measured latency numbers?

Here are a few things I would try:

Lock GPU clock: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#gpu-clock-lock-float
Disable H2D/D2H copies: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#h2d-d2h-data-trans-pci-band
Try TCC driver mode (only available on Quadro/Tesla GPUs): https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#tcc-mode-wddm-mode
Use CUDA graphs: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#enqueue-bound-workload
Use spin-wait for cudaStreamSynchronize() or cudaEventSynchronize(): https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#synch-modes

Aug 12 '22 09:08 nvpohanh

@nvpohanh My code for measuring inference time is as follows: auto time_2 = std::chrono::high_resolution_clock::now(); m_context->enqueueV2(m_buffers->getDeviceBindings().data(), m_stream, nullptr); cudaStreamSynchronize(m_stream); auto time_3 = std::chrono::high_resolution_clock::now(); std::chrono::duration<double, std::milli> inferenceTime = (time_3 - time_2);

Under 473.47 driver with T600(win10), the inferenceTime about 5-6ms, and continuous inference is very stable. Under 511.65 driver with T600(win10),: Using the aforementioned timing method, The inference time was saved into the following file: time.txt Using cudaEvent_t: It takes about the same time as above. cudaEvent_t start, end; cudaEventCreate(&start); cudaEventCreate(&end); cudaEventRecord(start, m_stream); m_context->enqueueV2(m_buffers->getDeviceBindings().data(), m_stream, nullptr); cudaStreamSynchronize(m_stream); cudaEventRecord(end, m_stream); cudaEventSynchronize(end); float totalTime; cudaEventElapsedTime(&totalTime, start, end); cudaEvent_t.txt

Under 511.65 driver with T600(win10), I have made the following improvements according to the link you gave:

Add cudaGraph as follows: cudaGraph_t graph; cudaGraphExec_t instance; cudaStreamBeginCapture(m_stream, cudaStreamCaptureModeGlobal); m_context->enqueueV2(m_buffers->getDeviceBindings().data(), m_stream, nullptr); cudaStreamEndCapture(m_stream, &graph); cudaGraphInstantiate(&instance, graph, NULL, NULL, 0); cudaGraphLaunch(instance, m_stream); cudaStreamSynchronize(m_stream); The inference time was saved into the following file, at the beginning there are still large fluctuations. and remains at 3-4ms after a period of time. add_cuda_graph.txt
the input and output memory is preallocated with cudaMallocHost. The inference time has no significant change.
Add cudaSetDeviceFlags(cudaDeviceScheduleSpin); The inference time has no significant change.

In addition, I tested the same model with Trtexec Under 511.65 driver with T600(win10),:

.\trtexec.exe --loadEngine=test.engine --warmUp=0 --duration=0 --iterations=50 The inference time fluctuates greatly
.\trtexec.exe --loadEngine=test.engine --warmUp=0 --duration=0 --iterations=50 --useCudaGraph --noDataTransfers --useSpinWait Repeated several times, are relatively stable. But under 473.47 driver with T600(win10), The results seem to be the same as the above two experiments

Conclusions and Questions

The case of large fluctuation of inference time has been reproduced on T600, Gforce1060 and RTX3060. Among them, T600 and 1060 in Win10 to reduce the driver version to solve the problem. However, RTX3060 did not find a compatible driver version below 500 under WIN11.
From the above experiments, it seems that using --useCudaGraph --noDataTransfers --useSpinWait can get better data under trtexec, but the results obtained under 473.47 and 511.65 drivers are almost the same. Does that mean it has nothing to do with the driver version?
From the above experiments, Under my runtime inference code, Using a function like "--useCudaGraph --noDataTransfers --useSpinWait" and not getting better inference time data? Do I using it the wrong way?
From the above experiments, Under my runtime inference code: Note Reducing the driver version can solve the problem of large fluctuations in inference time. But it's not consistent with the second point?
No one has reported similar problems before?
under 473.47, the inferenceTime about 5-6ms. Under 511.65 and add cudaGraph, Although the inference time fluctuated greatly at the beginning of the period, but it was maintained at 2-3ms in the subsequent period. Does this mean that using Cudagraph can speeds up inference?

Aug 14 '22 09:08 wang7393

@nvpohanh add test: In the above experiment, I added Cudagraph to my Runtime code, and after running for a while the inference time stabilized at 3-4ms was overturned because I did not release Graph and instance which leads to gpu memory leaks. After I added the following code: cudaGraphDestroy(graph); cudaGraphExecDestroy(instance); There is no GPU memory leaks. But the inference time still fluctuates wildly and The inference time is not stable at 3-4ms after running for a period of time.

Aug 15 '22 02:08 wang7393

From the above experiments, it seems that using --useCudaGraph --noDataTransfers --useSpinWait can get better data under trtexec, but the results obtained under 473.47 and 511.65 drivers are almost the same. Does that mean it has nothing to do with the driver version?

Is it possible to find out which of the three flags solves the instability issue? For example, does adding "--useSpinWait" without the other two give stable result?

Aug 15 '22 04:08 nvpohanh

@nvpohanh only --useCudaGraph can get better results. As with the other issues I mentioned earlier, how do I solve this problem under Runtime.

Aug 15 '22 05:08 wang7393

I see. Could you add a dummy enqueueV2() call before capturing the graph? Also, the graph capturing only needs to be done once at the start of the application. Then, you only need to call cudaGraphLaunch(instance, m_stream); and cudaStreamSynchronize(m_stream); for actual inference.

cudaGraph_t graph;
cudaGraphExec_t instance;
m_context->enqueueV2(m_buffers->getDeviceBindings().data(), m_stream, nullptr); // a dummy run before capturing
cudaStreamBeginCapture(m_stream, cudaStreamCaptureModeGlobal);
m_context->enqueueV2(m_buffers->getDeviceBindings().data(), m_stream, nullptr);
cudaStreamEndCapture(m_stream, &graph);
cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
cudaGraphLaunch(instance, m_stream);
cudaStreamSynchronize(m_stream);

Aug 15 '22 06:08 nvpohanh

@nvpohanh under 473.47, the inference time is double. under 511.65, inference times still fluctuate wildly.

Aug 15 '22 07:08 wang7393

@nvpohanh If there are any new developments or conclusions, please synchronize them. Thanks.

Aug 17 '22 03:08 wang7393

I have the same problem. When the driver is 511.79, the trt inference time fluctuates too much, when i reduce driver version to 472.12, the inference time is stable. Is there a conclusion now?

Oct 11 '22 02:10 ygch