TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

Failure of TensorRT 10.7 when running inference on A4500

Open Alex18947 opened this issue 8 months ago • 4 comments

Description

We successfully run inference with our model and observe some stability issues. After hours / days of runtime IExecutionContext:: enqueue(V2/V3) suddenly starts returning false and does not recover any more. This is followed by a driver crash visible in the Windows event log:

Display Driver nvlddmkm Stopped Responding.

The problem is that the documentation does not mention what this actually means, or how such situation (the false result) should be dealt with.

The errors happen randomly after within 2 to 200 hours of running the inference loop.

We are using TensorRT directly from C++ code.

What we've checked so far:

  • There is no GPU resource leakage whatsoever as all resources (GPU RAM, streams etc.) are allocated before entering the inference loop, regular resource checks are performed.
  • Input tensor dimensions are set before entering the inference loop so no dimensions mismatch can happen.
  • We used CUDA Compute Sanitizer to confirm the inference loop is healthy.

Environment

TensorRT Version: 10.7

NVIDIA GPU: A4500 Ada (reproduced on Ampere as well).

NVIDIA Driver Version: Latest as of today (4.4.2025)

CUDA Version: 12.6

CUDNN Version: 9.6

Operating System: Windows Server 2022

Notable observations

When running on a Server with 2x A4500 Ada, despite otherwise independent and fed with different data, both inference loops (each GPU has its own loop with dedicated resources), stop working at the exact same time ending up with damaged inference engine instances, which, in my eyes, somehow points at the driver.

Are there any ways / strategies or whatever to narrow down the possible causes of these troubles? Nothing seems suspicious until the inference engine starts to return false from its execution methods and the driver crashes.

Thanks.

Alex18947 avatar Apr 04 '25 13:04 Alex18947

Sounds like it's caused by driver crash. Is the driver up to date and does rebooting help?

yuanyao-nv avatar Apr 22 '25 20:04 yuanyao-nv

Yes. We regularly update the drivers. We first saw this problem with TensorRT 8, now we are using 10.7. The problem is the random and rare occurrence and almost no information we are able to collect when the problem happens. I don't expect anybody to tell me the cause here, instead, I am trying to find some ways, if any, how to collect some debug data or any relevant info to create a better report.

We've recently added an IErrorRecorder implementation to the execution context to see if we can get more info.

No reboot is necessary, the driver "restores" itself and we just restart our process.

Alex18947 avatar Apr 24 '25 07:04 Alex18947

Ok, we were finally able to get some information from the IErrorRecorder interface. Since it takes days to weeks for the execution context to fail, it took some time. The error message reported by the execution context is:

IExecutionContext::enqueueV3: Error Code 1: CuTensor (Internal cuTensor permutate execute failed)

This is, as always, followed by the nvlddmkm stopped working message in the Windows event log.

Does it help in any way?

Alex18947 avatar Jun 02 '25 09:06 Alex18947

I see there's an issue with similar error message here: https://github.com/NVIDIA/TensorRT/issues/3308

Could you try the trtexec excution path to see if it's reproducible there?

yuanyao-nv avatar Jun 17 '25 23:06 yuanyao-nv

No. At least not easily. As I cannot play around with the customer servers, I would basically need to build a new server, configure the high load scenario and wait days or weeks until I eventually see the driver crash. What we continue to do is updating CUDA, TensortRT and the drivers in the hope that one day this problem will go away. There do not seem to be any other good ways to even report this instability issue due to its random occurence and hard reproducibility...

Alex18947 avatar Jun 30 '25 08:06 Alex18947

Well, the error continues in CUDA 12.9 and TRT 10.10. Is there really no strategy to narrow this down somehow?

Alex18947 avatar Jul 09 '25 13:07 Alex18947