TensorRT_EX
Enviroments
- Windows 10 laptop
- CPU i7-11375H
- GPU RTX-3060
- Visual studio 2017
- CUDA 11.1
- TensorRT 8.0.3.4 (unet)
- TensorRT 8.2.0.6 (detr, yolov5s, real-esrgan)
- Opencv 3.4.5
- make Engine directory for engine file
- make Int8_calib_table directory for ptq calibration table
Custom plugin example
- Layer for input preprocess(NHWC->NCHW, BGR->RGB, [0, 255]->[0, 1] (Normalize))
- plugin_ex1.cpp (plugin sample code)
- preprocess.hpp (plugin define)
- preprocess.cu (preprocessing cuda kernel function)
- Validation_py/Validation_preproc.py (Result validation with pytorch)
Classification model
vgg11 model
- vgg11.cpp
- with preprocess plugin
resnet18 model
- resnet18.cpp
- 100 images from COCO val2017 dataset for PTQ calibration
- Match all results with PyTorch
- Comparison of calculation execution time of 100 iteration and GPU memory usage for one 224x224x3 image
|
Pytorch | TensorRT | TensorRT | TensorRT |
Precision | FP32 | FP32 | FP16 | Int8(PTQ) |
Avg Duration time [ms] |
4.1 ms |
1.7 ms |
0.7 ms |
0.6 ms |
FPS [frame/sec] |
243 fps |
590 fps |
1385 fps |
1577 fps |
Memory [GB] |
1.551 GB |
1.288 GB |
0.941 GB |
0.917 GB |
Semantic Segmentaion model
- UNet model (unet.cpp)
- use TensorRT 8.0.3.4 version for unet model(For version 8.2.0.6, an error about the unet model occurs)
- unet_carvana_scale0.5_epoch1.pth
- additional preprocess (resize & letterbox padding) with openCV
- postprocess (model output to image)
- Match all results with PyTorch
- Comparison of calculation execution time of 100 iteration and GPU memory usage for one 512x512x3 image
|
Pytorch | Pytorch | TensorRT | TensorRT | TensorRT |
Precision | FP32 | FP16 | FP32 | FP16 | Int8(PTQ) |
Avg Duration time [ms] |
66.21 ms |
34.58 ms |
40.81 ms |
13.52 ms |
8.19 ms |
FPS [frame/sec] |
15 fps |
29 fps |
25 fps |
77 fps |
125 fps |
Memory [GB] |
3.863 GB |
2.677 GB |
1.552 GB |
1.367 GB |
1.051 GB |
Object Detection model(ViT)
- DETR model (detr_trt.cpp)
- additional preprocess (mean std normalization function)
- postprocess (show out detection result to the image)
- Match all results with PyTorch
- Comparison of calculation execution time of 100 iteration and GPU memory usage for one 500x500x3 image
|
Pytorch | Pytorch | TensorRT | TensorRT | TensorRT |
Precision | FP32 | FP16 | FP32 | FP16 | Int8(PTQ) |
Avg Duration time [ms] |
37.03 ms |
30.71 ms |
16.40 ms |
6.07 ms |
5.30 ms |
FPS [frame/sec] |
27 fps |
33 fps |
61 fps |
165 fps |
189 fps |
Memory [GB] |
1.563 GB |
1.511 GB |
1.212 GB |
1.091 GB |
1.005 GB |
Object Detection model
- Yolov5s model (yolov5s.cpp)
- Comparison of calculation execution time of 100 iteration and GPU memory usage for one 640x640x3 image resized & padded
|
Pytorch | TensorRT | TensorRT |
Precision | FP32 | FP32 | Int8(PTQ) |
Avg Duration time [ms] |
7.72 ms |
6.16 ms |
2.86 ms |
FPS [frame/sec] |
129 fps |
162 fps |
350 fps |
Memory [GB] |
1.670 GB |
1.359 GB |
0.920 GB |
Super-Resolution model
- Real-ESRGAN model (real-esrgan.cpp)
- RealESRGAN_x4plus.pth
- Scale up 4x (448x640x3 -> 1792x2560x3)
- Comparison of calculation execution time of 100 iteration and GPU memory usage
- [update] RealESRGAN_x2plus model (set OUT_SCALE=2)
|
Pytorch | Pytorch | TensorRT | TensorRT |
Precision | FP32 | FP16 | FP32 | FP16 |
Avg Duration time [ms] |
4109 ms |
1936 ms |
2139 ms |
737 ms |
FPS [frame/sec] |
0.24 fps |
0.52 fps |
0.47 fps |
1.35 fps |
Memory [GB] |
5.029 GB |
4.407 GB |
3.807 GB |
3.311 GB |
Object Detection model 2
- Yolov6s model (yolov6.cpp)
- Comparison of calculation execution time of 1000 iteration
and GPU memory usage (with preprocess, without nms, 536 x 640 x 3)
|
Pytorch | TensorRT | TensorRT | TensorRT |
Precision | FP32 | FP32 | FP16 | Int8(PTQ) |
Avg Duration time [ms] |
20.7 ms |
10.3 ms |
3.54 ms |
2.58 ms |
FPS [frame/sec] |
48.14 fps |
96.21 fps |
282.26 fps |
387.89 fps |
Memory [GB] |
1.582 GB |
1.323 GB |
0.956 GB |
0.913 GB |
Object Detection model 3 (in progress)
- Yolov7 model (yolov7.cpp)
Using C TensoRT model in Python using dll
A typical TensorRT model creation sequence using TensorRT API
- Prepare the trained model in the training framework (generate the weight file to be used in TensorRT).
- Implement the model using the TensorRT API to match the trained model structure.
- Extract weights from the trained model.
- Make sure to pass the weights appropriately to each layer of the prepared TensorRT model.
- Build and run.
- After the TensorRT model is built, the model stream is serialized and generated as an engine file.
- Inference by loading only the engine file in the subsequent task(if model parameters or layers are modified, re-execute the previous (4) task).
reference