yolov7 Add TensorRT infer support

This PR is intended to export model from PyTorch to onnx, and then serialize the exported onnx model to native TRT engine, which will be inferred using TensorRT I.E,

[x] Implement onnx_to_tensorrt.py script
[x] Export onnx model to TensorRT engine
[x] Implement python module to infer from serialized TRT engine
[x] Integrate pre-process & post process functions from detect.py script to TensorRT infer script
[x] Drawing bounding box detection's over sample image
[x] Custom Detect plugin integration for TensorRT < 8.0
[x] Implement INT8 calibrator script for INT8 serialization

Jul 08 '22 18:07 xiang-wuu

When exported with --grid: python models/export.py --weights yolov7.pt --grid

Building the TensorRT engine fails:

root@3aa30b614471:/workspace/yolov7# python deploy/TensoorRT/onnx_to_tensorrt.py --onnx yolov7.onnx --fp16 --explicit-batch -o yolov7.engine
Namespace(calibration_batch_size=128, calibration_cache='calibration.cache', calibration_data=None, debug=False, explicit_batch=True, explicit_precision=False, fp16=True, gpu_fallback=False, int8=False, max_batch_size=None, max_calibration_size=2048, onnx='yolov7.onnx', output='yolov7.engine', refittable=False, simple=False, strict_types=False, verbosity=None)
2022-07-10 07:53:52 - __main__ - INFO - TRT_LOGGER Verbosity: Severity.ERROR
2022-07-10 07:53:52 - __main__ - INFO - Setting BuilderFlag.FP16
[TensorRT] ERROR: [graphShapeAnalyzer.cpp::throwIfError::1306] Error Code 9: Internal Error (Mul_378: broadcast dimensions must be conformable
)
ERROR: Failed to parse the ONNX file: yolov7.onnx
In node 378 (parseGraph): INVALID_NODE: Invalid Node - Mul_378
[graphShapeAnalyzer.cpp::throwIfError::1306] Error Code 9: Internal Error (Mul_378: broadcast dimensions must be conformable
)

Any idea how to fix that @xiang-wuu ?

Jul 10 '22 07:07 philipp-schmidt

I have the same issue when using trtexec for conversion, so this is definitely a TensorRT / ONNX issue. Here: #66

Jul 10 '22 08:07 philipp-schmidt

@philipp-schmidt that could be an issue due to PyTorch and ONNX version, try upgrading to latest versions for both of them. however am working on post-processing part with --grid option which returns primary output node with shape (1, 25200, 85).

Jul 10 '22 10:07 xiang-wuu

Yes it was the pytorch version. I also had to run onnx-simplify, otherwise TensorRT had issues with a few resize operations.

Looking forward to try your implementation.

Jul 10 '22 11:07 philipp-schmidt

Almost done, with some final typo's to be resolved.

Jul 12 '22 16:07 xiang-wuu

Quickly scanned the code and it looks really good!

A few questions / remarks:

you use yolov7.cache for INT8, how do you put that together? Still a todo? Actually im curios about yolov7 INT8 performance to accuracy tradeoff, so that would be cool to see!
Conversion from onnx to TensorRT can also be done with TensorRT directly without any additional code. The NGC TensorRT docker images come with a precompiled tool "trtexec" which will happily turn ONNX into an engine.
I'm looking into making the batch size dynamic so that e.g. Triton Inference Server can combine smaller requests into larger batch sizes via a feature called Dynamic Batching. (e.g. pack multiple simultaniously arriving batch 1 requests into one larger batch 4) While coding this, did you somehow manage to make the input batch size of the TensorRT engine dynamic up to a maximum batch size? So basically the input shape will be either [-1,640,640,3] for explicit batching or [640,640,3] with implicit batching. In the past ONNX was unable to support implicit batching (still seems be the case) and custom plugins were a little hard to make work with dynamic (-1) + explicit batching.

Jul 12 '22 17:07 philipp-schmidt

Hi, sorry to write here. I've tried your branch with tensorrt and yolov7-tiny custom trained on a Nvidia Jetson Xavier NX. I converted the model trained from pytorch with no problem after some tries, but when testing results, both mAP and FPS are much lower:

PyTorch + cuda: ~40fps, 78mAP
TRT: ~24fps, 64mAP

Is this normal? Am I doing something wrong?

Jul 12 '22 17:07 albertfaromatics

Quickly scanned the code and it looks really good!

A few questions / remarks:

1. you use yolov7.cache for INT8, how do you put that together? Still a todo? Actually im curios about yolov7 INT8 performance to accuracy tradeoff, so that would be cool to see!

2. Conversion from onnx to TensorRT can also be done with TensorRT directly without any additional code. The NGC TensorRT docker images come with a precompiled tool "trtexec" which will happily turn ONNX into an engine.

3. I'm looking into making the batch size dynamic so that e.g. Triton Inference Server can combine smaller requests into larger batch sizes via a feature called Dynamic Batching. (e.g. pack multiple simultaniously arriving batch 1 requests into one larger batch 4)
   While coding this, did you somehow manage to make the input batch size of the TensorRT engine dynamic up to a maximum batch size?
   So basically the input shape will be either [-1,640,640,3] for explicit batching or [640,640,3] with implicit batching. In the past ONNX was unable to support implicit batching (still seems be the case) and custom plugins were a little hard to make work with dynamic (-1) + explicit batching.

will add calibration script for PTQ.
Yes, serialization with trtexec is possible but if using TRT < 8.0 the custom plugin need's to be preloaded.
I haven't tested for max. dynamic batch size , but as i know dynamic batching is effectively abstracted by Triton and by exporting the onnx model with implicit batching could make it work with Triton, still subject to trial & error!

Jul 13 '22 07:07 xiang-wuu

Hi, sorry to write here. I've tried your branch with tensorrt and yolov7-tiny custom trained on a Nvidia Jetson Xavier NX. I converted the model trained from pytorch with no problem after some tries, but when testing results, both mAP and FPS are much lower:
* PyTorch + cuda: ~40fps, 78mAP

* TRT: ~24fps, 64mAP
Is this normal? Am I doing something wrong?

Optimization is out of scope for this PR, this PR is intended to support minimalistic deployable TRT implementation, the optimization is altogether subject to further contribution.

Jul 13 '22 07:07 xiang-wuu

@albertfaromatics How do you test FPS and mAP? There is very little chance that your TensorRT engine is slower than pytorch directly. Especially on Jetson.

Jul 13 '22 07:07 philipp-schmidt

@philipp-schmidt For PyTorch + cuda, I simply adapted the detect.py here to read a folder of images (around 200 of them), compute prediction time (inference + nms) and compute fps For TensorRT, I followed the README on the repo, with export, simplify, onnx_to_tensorrt (I'm using TensorRT 8.4) and run.

These steps gave me the FPS (40ish vs 25ish). For mAP I used the test here and adapted a code to get the detections from TensorRT and "manually" compute mAP.

Jul 13 '22 07:07 albertfaromatics

Try to run your engine with trtexec instead, it will give you a very good indication of actual compute latency.

Last few steps of this: https://github.com/isarsoft/yolov4-triton-tensorrt#build-tensorrt-engine

Jul 13 '22 07:07 philipp-schmidt

I don't think that it comes prebuilt in the Linux 4 Tegra TensorRT docker images for jetson though.

Jul 13 '22 07:07 philipp-schmidt

@philipp-schmidt I'll give it a try. I can compile it myself from tensorrt/samples folder, but never used it before.

I'll try and see why I have this results. Thanks!

Jul 13 '22 08:07 albertfaromatics

@WongKinYiu good to merge.

Jul 13 '22 12:07 xiang-wuu

it works,but no bounding box is drawn

Jul 14 '22 16:07 ccqedq

it works,but no bounding box is drawn

share the environment details?

Jul 14 '22 18:07 xiang-wuu

torch 1.11.0+cu113 onnx 1.12.0 tensorrt 8.4.1.5
I use ScatterND op built-in plugin to run the code, but found no bounding box is drawn Considering the built-in plugin is used, is this a problem of data preprocessing?

Jul 15 '22 07:07 ccqedq

I use deploy_onnx_trt branch to generate yolov7.onnx, to get yolov7.engine, I run the following command: python3 onnx_to_tensorrt.py --explicit-batch --onnx yolov7-sim.onnx -o yolov7.engine

Jul 15 '22 07:07 ccqedq

@dongdengwei , try without building the plugin , if using TRT > 8.0

Jul 15 '22 08:07 xiang-wuu

I run the following command to do the inference: python3 yolov7_trt.py video1.mp4 still no bounding box

Jul 15 '22 08:07 ccqedq

it seem that I should replace "return x if self.training else (torch.cat(z, 1), x)" with "return x if self.training else (torch.cat(z, 1), ) if not self.export else (torch.cat(z, 1), x)" in yolo.py. but in environment torch 1.10.1+cu111 onnx 1.8.1 tensorrt 7.2.3.4, it has the following error: 2022-07-15 18:18:59 - main - INFO - TRT_LOGGER Verbosity: Severity.ERROR getFieldNames createPlugin [TensorRT] ERROR: Mul_378: elementwise inputs must have same dimensions or follow broadcast rules (input dimensions were [1,3,80,80,2] and [1,1,1,3,2]). should I upgrade torch 1.10.1 to 1.11.0 and onnx 1.8.1 to 1.12.0

Jul 15 '22 10:07 ccqedq

@dongdengwei PyTorch > 1.11.0 is required to make it work, recommended is 1.12.0

Jul 15 '22 12:07 xiang-wuu

@xiang-wuu @philipp-schmidt @AlexeyAB @Linaom1214 can you share the map performance of converted model? is the accuracy same after conversion ? or how much drop in accuracy?also it would be great if you add support for checking map of .trt model ,its inference on video. Thanks

Aug 02 '22 05:08 akashAD98

https://github.com/Linaom1214/tensorrt-python/issues/26 not able to do inference on videos

Aug 10 '22 08:08 akashAD98

Hi, sorry to write here. I've tried your branch with tensorrt and yolov7-tiny custom trained on a Nvidia Jetson Xavier NX. I converted the model trained from pytorch with no problem after some tries, but when testing results, both mAP and FPS are much lower:

PyTorch + cuda: ~40fps, 78mAP

TRT: ~24fps, 64mAP

Is this normal? Am I doing something wrong?

Hi, I have tested the yolov7-tiny tensorRT model on jetson Xavier NX by my own code, and the result is showed in issue #703:https://github.com/WongKinYiu/yolov7/issues/703, maybe you can check it.

Sep 05 '22 14:09 Stoooner

Linaom1214/TensorRT-For-YOLO-Series#26 not able to do inference on videos

the reason is colab env don't support opencv imshow fucntion

Sep 05 '22 14:09 Linaom1214

Hi, sorry to write here. I've tried your branch with tensorrt and yolov7-tiny custom trained on a Nvidia Jetson Xavier NX. I converted the model trained from pytorch with no problem after some tries, but when testing results, both mAP and FPS are much lower:

PyTorch + cuda: ~40fps, 78mAP

TRT: ~24fps, 64mAP

Is this normal? Am I doing something wrong?

Hi @xiang-wuu , I'm using a Nvidia Jetson Xavier AGX with Jetpack version 4.6.1 and CUDA version 10.2. I would like to recreate these results for both .pt and TRT formats. We have tried to convert to .engine files using the 'trtexec' already present with the L4T installation in the Jetson device, but the inference timings are not good. For inference we used the 'official yolov7 deepstream inference script' from NVIDIA.

Environment setup:

Should the requirements.txt from yolov7 repo be used on Xavier AGX as it is? Should we install Pytorch from 'Pytorch for Jetson'. The Pytorch wheel corresponding to Jetpack 4.6.1 is this.

Inference on Jetson device:

Is the original detect.py sufficient for inference using .pt weights on Jetson devices? Is the YOLOv7ONNXandTRT.ipynb file sufficient for inference using TRT format weights on Jetson devices?

Looking forward to your response.

Cheers :)

Oct 14 '23 12:10 9friday

yolov7 yolov7 copied to clipboard

Add TensorRT infer support

yolov7
yolov7 copied to clipboard