TensorRT [Feature request] Make using incompatible timing caches for building CUDA engines not a hard error

Description

I am currently researching ways for speeding up building TensorRT's CUDA engines in the runtime from ONNX models using the C++ API. That is, I am loading the ONNX file, building a CUDA engine from it and then using the latter for inference. For some specific reasons it is not possible for my application neither to prebuild engines before running the application nor to store those on a filesystem for a future usage. Thus, I am looking for ways to improve CUDA engine building speed. I have explored all the available API options, and this exploration resulted in the Timing Cache being the best option for me. Here is the rough explanation of the approach I came up with:

Obtain several Timing Caches using several GPUs of architectures / CCs I want my application to support.
Combine those caches into a single one.
Provide the combined cache to a builder while building a CUDA engine from the ONNX file.
Build the CUDA engine fast, because the provided Timing Cache already contains needed profiling metrics.

Unfortunately, the last step fails in some specific environments with the following error message:

[TensorRT] 2: [caskBuilderUtils.cpp::getCaskHandle::408] Error Code 2: Internal Error (Assertion !isGroupConv || (findConvShaderByHandle(klib, caskHandle) || cask_trt::availableLinkableConvShaders()->findByHandle(caskHandle)) failed. )
[TensorRT] 2: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )

I suspect that the builder is picking some tactic from the provided cache, but the tactic is not available on the current GPU device. Assuming this guess is right, it does not seem to be an inevitable error since the incompatible tactic could be skipped (I tried setting the ignoreMismatch flag with no luck). So the feature request is essentially to make the build skip the cache entry and proceed instead of throwing a hard error.

P.S. I am aware that timing cache is not recommended to be reused across devices, but my benchmarks shows that it's harmless.

Environment

TensorRT Version: NVIDIA GPU: 8.4.1 NVIDIA Driver Version: CUDA Version: Any CUDNN Version: Any Operating System: Any

Jul 13 '22 17:07 dev0x13

@nvpohanh ^ ^

Jul 15 '22 16:07 zerollzeng

@dev0x13 , have you tried optimizationlevel to 0 with latest 8.6 release? does the build speed fits your use scenario? Thanks!

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt-builder-optimization-level

May 30 '23 20:05 ttyio

@ttyio Yes, actually I tried this literally yesterday. The engine build time decreases tremendously, but the latency becomes terrible, so I decided to stick to optimizationlevel = 2 as a middle-ground. Generally speaking, I've developed a workaround for the incompatible timing cache issue: I pre-build multiple timing caches for different compute capability versions and then just do dispatching on the runtime based on the compute capability of the CUDA device currently being used. This works like a charm, however being able to have a single timing cache for all architectures would still be great. Thanks!

May 30 '23 20:05 dev0x13

Cool @dev0x13 , glad to know you have WAR in your flow!

The timing cache is opaque object to user, and we do not have API to manipulate the cache file like edit kernel for layer A, or remove kernel for layer B. So it make sense to have the management of timing cache file in user side like what you have done in your workflow now. You can have explicit control on what's needed and what's not. thanks!

May 30 '23 21:05 ttyio

closing since no activity for more than 3 weeks, pls reopen if you still have question, thanks!

Jun 27 '23 21:06 ttyio

TensorRT TensorRT copied to clipboard

[Feature request] Make using incompatible timing caches for building CUDA engines not a hard error

Description

Environment

TensorRT
TensorRT copied to clipboard