TensorRT
TensorRT copied to clipboard
❓ [Question] Speed problem about TRTorch and Torch-TensorRT - Device Compatibility Check
Question1
I find the Torchscript model optimized by TRTorch 0.2.0 faster than TensorRT model(All models are Python API),such as common ResNet series, RapVGG series models and so on, shouldn't the TensorRT model be the fastest? I want to know why does this happen.
Torchscript model(optimized by TRTorch 0.2.0):
- torch 1.7.1+cu110
- trtorch 0.2.0
- TensorRT 7.2
- cuDNN 8.2
- GPU:Tesla T4
- CentOS Linux release 7.6.1810 (Core)
TensorRT model(.trt):
- torch 1.7.1+cu110
- tensorrt 8.2.0.6
- cuDNN 8.2
- GPU:Tesla T4 -CentOS Linux release 7.6.1810 (Core)
Question2
I found the inference speed of TorchScript model is different after using different versions of Torch-TensoRT(TRTorch) to optimize with the same structure. For the same structure ResNet series model, TorchScript model(optimized by TRTorch 0.2.0,torch 1.7.1-cu110,TensorRT 7.2 and cuDNN 8.2) is faster than TorchScript model(optimized by Torch-TensorRT 1.0.0,torch 1.10.1-cu113,TensorRT 8.0 and cuDNN 8.2), shouldn't the latest Torch-TensorRT 1.0.0 be faster ?I'm also very confused.
- GPU:Tesla T4
- CentOS Linux release 7.6.1810 (Core)
- input shape: (1,3,224,224)
Here are some of my test results.
Between these two versions there was a constant time operation that was added to check compatibility of the current device with the compiled model. This is likely the overhead you are experiencing.
We are investigating if this can be mitigated for future inferences once the model is loaded.
This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days
This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days
[Removed]
This device check cannot currently be mitigated safely. We are investigating options in TRT to reduce this overhead.
Explore using https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaPointerAttributes.html#structcudaPointerAttributes to query data locations and assume current device is correct?
https://developer.nvidia.com/blog/cuda-pro-tip-the-fast-way-to-query-device-properties/
This check was added that likely caused perf issue: https://github.com/pytorch/TensorRT/blob/bf4474dc7816c184489d3985ce892315f5e0cc42/core/runtime/runtime.cpp#L81
This check invokes a constructor for a TensorRT wrapper object RTDevice::RTDevice https://github.com/pytorch/TensorRT/blob/bf4474dc7816c184489d3985ce892315f5e0cc42/core/runtime/RTDevice.cpp#L16
And this is invoking cudaGetDeviceProperties which is expensive, but the above article may be used to mitigate the issue.
This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days