yolov7
yolov7 copied to clipboard
Add TensorRT infer support
This PR is intended to export model from PyTorch to onnx
, and then serialize the exported onnx model to native TRT engine, which will be inferred using TensorRT I.E,
- [x] Implement
onnx_to_tensorrt.py
script - [x] Export
onnx
model to TensorRT engine - [x] Implement python module to infer from serialized TRT engine
- [x] Integrate pre-process & post process functions from
detect.py
script to TensorRT infer script - [x] Drawing bounding box detection's over sample image
- [x] Custom Detect plugin integration for TensorRT < 8.0
- [x] Implement INT8 calibrator script for INT8 serialization
When exported with --grid:
python models/export.py --weights yolov7.pt --grid
Building the TensorRT engine fails:
root@3aa30b614471:/workspace/yolov7# python deploy/TensoorRT/onnx_to_tensorrt.py --onnx yolov7.onnx --fp16 --explicit-batch -o yolov7.engine
Namespace(calibration_batch_size=128, calibration_cache='calibration.cache', calibration_data=None, debug=False, explicit_batch=True, explicit_precision=False, fp16=True, gpu_fallback=False, int8=False, max_batch_size=None, max_calibration_size=2048, onnx='yolov7.onnx', output='yolov7.engine', refittable=False, simple=False, strict_types=False, verbosity=None)
2022-07-10 07:53:52 - __main__ - INFO - TRT_LOGGER Verbosity: Severity.ERROR
2022-07-10 07:53:52 - __main__ - INFO - Setting BuilderFlag.FP16
[TensorRT] ERROR: [graphShapeAnalyzer.cpp::throwIfError::1306] Error Code 9: Internal Error (Mul_378: broadcast dimensions must be conformable
)
ERROR: Failed to parse the ONNX file: yolov7.onnx
In node 378 (parseGraph): INVALID_NODE: Invalid Node - Mul_378
[graphShapeAnalyzer.cpp::throwIfError::1306] Error Code 9: Internal Error (Mul_378: broadcast dimensions must be conformable
)
Any idea how to fix that @xiang-wuu ?
I have the same issue when using trtexec for conversion, so this is definitely a TensorRT / ONNX issue. Here: #66
@philipp-schmidt that could be an issue due to PyTorch and ONNX version, try upgrading to latest versions for both of them. however am working on post-processing part with --grid
option which returns primary output node with shape (1, 25200, 85)
.
Yes it was the pytorch version. I also had to run onnx-simplify, otherwise TensorRT had issues with a few resize operations.
Looking forward to try your implementation.
Almost done, with some final typo's to be resolved.
Quickly scanned the code and it looks really good!
A few questions / remarks:
-
you use yolov7.cache for INT8, how do you put that together? Still a todo? Actually im curios about yolov7 INT8 performance to accuracy tradeoff, so that would be cool to see!
-
Conversion from onnx to TensorRT can also be done with TensorRT directly without any additional code. The NGC TensorRT docker images come with a precompiled tool "trtexec" which will happily turn ONNX into an engine.
-
I'm looking into making the batch size dynamic so that e.g. Triton Inference Server can combine smaller requests into larger batch sizes via a feature called Dynamic Batching. (e.g. pack multiple simultaniously arriving batch 1 requests into one larger batch 4) While coding this, did you somehow manage to make the input batch size of the TensorRT engine dynamic up to a maximum batch size? So basically the input shape will be either [-1,640,640,3] for explicit batching or [640,640,3] with implicit batching. In the past ONNX was unable to support implicit batching (still seems be the case) and custom plugins were a little hard to make work with dynamic (-1) + explicit batching.
Hi, sorry to write here. I've tried your branch with tensorrt and yolov7-tiny custom trained on a Nvidia Jetson Xavier NX. I converted the model trained from pytorch with no problem after some tries, but when testing results, both mAP and FPS are much lower:
- PyTorch + cuda: ~40fps, 78mAP
- TRT: ~24fps, 64mAP
Is this normal? Am I doing something wrong?
Quickly scanned the code and it looks really good!
A few questions / remarks:
1. you use yolov7.cache for INT8, how do you put that together? Still a todo? Actually im curios about yolov7 INT8 performance to accuracy tradeoff, so that would be cool to see! 2. Conversion from onnx to TensorRT can also be done with TensorRT directly without any additional code. The NGC TensorRT docker images come with a precompiled tool "trtexec" which will happily turn ONNX into an engine. 3. I'm looking into making the batch size dynamic so that e.g. Triton Inference Server can combine smaller requests into larger batch sizes via a feature called Dynamic Batching. (e.g. pack multiple simultaniously arriving batch 1 requests into one larger batch 4) While coding this, did you somehow manage to make the input batch size of the TensorRT engine dynamic up to a maximum batch size? So basically the input shape will be either [-1,640,640,3] for explicit batching or [640,640,3] with implicit batching. In the past ONNX was unable to support implicit batching (still seems be the case) and custom plugins were a little hard to make work with dynamic (-1) + explicit batching.
- will add calibration script for PTQ.
- Yes, serialization with
trtexec
is possible but if using TRT < 8.0 the custom plugin need's to be preloaded. - I haven't tested for max. dynamic batch size , but as i know dynamic batching is effectively abstracted by Triton and by exporting the onnx model with implicit batching could make it work with Triton, still subject to trial & error!
Hi, sorry to write here. I've tried your branch with tensorrt and yolov7-tiny custom trained on a Nvidia Jetson Xavier NX. I converted the model trained from pytorch with no problem after some tries, but when testing results, both mAP and FPS are much lower:
* PyTorch + cuda: ~40fps, 78mAP * TRT: ~24fps, 64mAP
Is this normal? Am I doing something wrong?
Optimization is out of scope for this PR, this PR is intended to support minimalistic deployable TRT implementation, the optimization is altogether subject to further contribution.
@albertfaromatics How do you test FPS and mAP? There is very little chance that your TensorRT engine is slower than pytorch directly. Especially on Jetson.
@philipp-schmidt For PyTorch + cuda, I simply adapted the detect.py here to read a folder of images (around 200 of them), compute prediction time (inference + nms) and compute fps For TensorRT, I followed the README on the repo, with export, simplify, onnx_to_tensorrt (I'm using TensorRT 8.4) and run.
These steps gave me the FPS (40ish vs 25ish). For mAP I used the test here and adapted a code to get the detections from TensorRT and "manually" compute mAP.
Try to run your engine with trtexec instead, it will give you a very good indication of actual compute latency.
Last few steps of this: https://github.com/isarsoft/yolov4-triton-tensorrt#build-tensorrt-engine
I don't think that it comes prebuilt in the Linux 4 Tegra TensorRT docker images for jetson though.
@philipp-schmidt I'll give it a try. I can compile it myself from tensorrt/samples folder, but never used it before.
I'll try and see why I have this results. Thanks!
@WongKinYiu good to merge.
it works,but no bounding box is drawn
it works,but no bounding box is drawn
share the environment details?
torch 1.11.0+cu113 onnx 1.12.0 tensorrt 8.4.1.5
I use ScatterND op built-in plugin to run the code, but found no bounding box is drawn
Considering the built-in plugin is used, is this a problem of data preprocessing?
I use deploy_onnx_trt branch to generate yolov7.onnx, to get yolov7.engine, I run the following command: python3 onnx_to_tensorrt.py --explicit-batch --onnx yolov7-sim.onnx -o yolov7.engine
@dongdengwei , try without building the plugin , if using TRT > 8.0
I run the following command to do the inference: python3 yolov7_trt.py video1.mp4 still no bounding box
it seem that I should replace "return x if self.training else (torch.cat(z, 1), x)" with "return x if self.training else (torch.cat(z, 1), ) if not self.export else (torch.cat(z, 1), x)" in yolo.py. but in environment torch 1.10.1+cu111 onnx 1.8.1 tensorrt 7.2.3.4, it has the following error: 2022-07-15 18:18:59 - main - INFO - TRT_LOGGER Verbosity: Severity.ERROR getFieldNames createPlugin [TensorRT] ERROR: Mul_378: elementwise inputs must have same dimensions or follow broadcast rules (input dimensions were [1,3,80,80,2] and [1,1,1,3,2]). should I upgrade torch 1.10.1 to 1.11.0 and onnx 1.8.1 to 1.12.0
@dongdengwei PyTorch > 1.11.0 is required to make it work, recommended is 1.12.0
@xiang-wuu @philipp-schmidt @AlexeyAB @Linaom1214 can you share the map performance of converted model? is the accuracy same after conversion ? or how much drop in accuracy?also it would be great if you add support for checking map of .trt model ,its inference on video. Thanks
https://github.com/Linaom1214/tensorrt-python/issues/26 not able to do inference on videos
Hi, sorry to write here. I've tried your branch with tensorrt and yolov7-tiny custom trained on a Nvidia Jetson Xavier NX. I converted the model trained from pytorch with no problem after some tries, but when testing results, both mAP and FPS are much lower:
- PyTorch + cuda: ~40fps, 78mAP
- TRT: ~24fps, 64mAP
Is this normal? Am I doing something wrong?
Hi, I have tested the yolov7-tiny tensorRT model on jetson Xavier NX by my own code, and the result is showed in issue #703:https://github.com/WongKinYiu/yolov7/issues/703, maybe you can check it.
Linaom1214/TensorRT-For-YOLO-Series#26 not able to do inference on videos
the reason is colab env don't support opencv imshow fucntion
Hi, sorry to write here. I've tried your branch with tensorrt and yolov7-tiny custom trained on a Nvidia Jetson Xavier NX. I converted the model trained from pytorch with no problem after some tries, but when testing results, both mAP and FPS are much lower:
- PyTorch + cuda: ~40fps, 78mAP
- TRT: ~24fps, 64mAP
Is this normal? Am I doing something wrong?
Hi @xiang-wuu , I'm using a Nvidia Jetson Xavier AGX with Jetpack version 4.6.1 and CUDA version 10.2. I would like to recreate these results for both .pt and TRT formats. We have tried to convert to .engine files using the 'trtexec' already present with the L4T installation in the Jetson device, but the inference timings are not good. For inference we used the 'official yolov7 deepstream inference script' from NVIDIA.
Environment setup:
Should the requirements.txt from yolov7 repo be used on Xavier AGX as it is? Should we install Pytorch from 'Pytorch for Jetson'. The Pytorch wheel corresponding to Jetpack 4.6.1 is this.
Inference on Jetson device:
Is the original detect.py sufficient for inference using .pt weights on Jetson devices? Is the YOLOv7ONNXandTRT.ipynb file sufficient for inference using TRT format weights on Jetson devices?
Looking forward to your response.
Cheers :)