yolov7 yolov7 inference slower than yolov5

yolov7 inference slower than yolov5

Open NicholasZollo opened this issue 1 year ago • 16 comments

I tried to do a comparison of inference speed with yolov7 and yolov5m trained on a custom dataset running on Tesla T4 16GB gpu. The paper claims that yolov7 should be significantly faster here, however on my testing the inference time on yolov7 was twice that of yolov5m. It seems that the inference time I'm getting is only proportional to the FLOPS of the model. To do the test I used --task speed flag on test.py in yolov7 and val.py on yolov5. I made sure that they were running on gpu, not the cpu, but this was still the case.

Jul 19 '22 19:07 NicholasZollo

I guess your are running batch 32 inference. For batch 32 inference, YOLOv7 takes 2.8 ms average inference time, and YOLOv5m takes 1.7 ms average inference time in the paper.

Jul 19 '22 22:07 WongKinYiu

I tried running batch size 1 inference, it increased the inference time for both but it did not make yolov7 run faster than yolov5m still. Is my method of speed testing, by running the test.py and val.py for yolov7/yolov5 with the --task speed flag correct?

Jul 20 '22 14:07 NicholasZollo

What are the inference time you get on yolov7-tiny, yolov7, yolov5n, yolov5s, yolov5m, and yolov5l.

Jul 21 '22 00:07 WongKinYiu

I am experiencing the same. I used the below settings:

python test.py --data data/test_yolo.yaml --img 640 --batch 1 --conf 0.001 --iou 0.65 --device cpu --weights yolov7.pt --name yolov7_640_val

and the result:

Speed: 439.6/1.3/440.9 ms inference/NMS/total per 640x640 image at batch-size 1 While for Yolov5, when I am running the below command for the same images, I get the following results".

python \yolov5\detect.py --source inference/images --device cpu
detect: weights=..\FFD\FFD_pipeline\yolov5\yolov5s.pt, source=inference/images, data=..\FFD\FFD_pipeline\yolov5\data\coco128.yaml, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=cpu, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=..\FFD\FFD_pipeline\yolov5\runs\detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False
YOLOv5  2022-7-5 Python-3.8.13 torch-1.11.0+cpu CPU

Fusing layers...
YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients
image 1/2 D:\Code\yolov7\inference\images\horses.jpg: 448x640 5 horses, Done. (0.107s)
image 2/2 D:\Code\yolov7\inference\images\horses1.jpg: 448x640 5 horses, Done. (0.089s)
Speed: 0.0ms pre-process, 98.0ms inference, 1.5ms NMS per image at shape (1, 3, 640, 640)
Results saved to exp9

Any idea?

Jul 21 '22 12:07 yousefis

CPU inference time is usually proportional to FLOPs.

Jul 21 '22 12:07 WongKinYiu

using the pretrained weights (except yolov7-tiny) on coco 2017 dataset w/ telsa t4 gpu:

yolov5: python val.py --data data/coco.yaml --weights [model] --batch-size 1 --imgsz 640 --task speed --device 0

yolov5n: [email protected] 0.535 [email protected]:.95 0.359 0.2ms pre-process, 4.5ms inference, 0.7ms NMS yolov5s: [email protected] 0.616 [email protected]:.95 0.439 0.2ms pre-process, 4.7ms inference, 0.7ms NMS yolov5m: [email protected] 0.672 [email protected]:.95 0.509 0.2ms pre-process, 6.8ms inference, 0.7ms NMS yolov5l: [email protected] 0.701 [email protected]:.95 0.546 0.2ms pre-process, 10.6ms inference, 0.7ms NMS

yolov7: python test.py --data data/coco.yaml --weights [model] --batch-size 1 --img-size 640 --task speed --device 0

yolov7-tiny: [email protected] 0.349 [email protected]:.95 0.236 5.0/0.7/5.6 ms inference/NMS/total (Trained for only 36 epochs) yolov7: [email protected] 0.616 [email protected]:.95 0.46 11.9/0.7/12.6 ms inference/NMS/total

I did notice that the mAP values displayed are not consistent with the evaluated pycocotools mAP which is the mAP consistent with the claimed mAP values and paper, so that may not be important. however the speed is coming out worse than claimed. There are some variations in the inference time but they are minor.

Jul 21 '22 14:07 NicholasZollo

I can not reproduce your results, but we have tested YOLOv7-tiny on both PyTorch and darknet, they showed consistent results. Maybe you could run experiment on darknet to check if your pytorch performance on YOLOv7 is normal or not. darknet.exe detector demo cfg/coco.data cfg/yolov7-tiny.cfg yolov7-tiny.weights test.mp4 -benchmark

Also it really strange of your posted results since T4 GPU is slower than V100, and your T4 inference time is about 30% faster than official u5 V100 inference time. Your T4 performance also more than twice faster than official u5 reported benchmark.

Other people also help us to benchmark on tensorrt, YOLOv7-tiny run about twice faster than YOLOv5s.

Jul 22 '22 00:07 WongKinYiu

I have used my laptop(GPU is GTX1650) to run yolov7 and yolov5-l. At first, it seems yolov7(150ms/image) is slower than yolov5-l(70ms/image). But I found this issue. When set half=False, the speed of yolov7 is becoming faster (60~70ms/image) which is colosed to yolov5-l. In my opinion, some NVIDIA GPUs don't support half inference well. Using 'half' inference may be harmful. It needs to set half=False for faster inference speed in such devices. Besides, as for parameters or model size, yolov7 is smaller than yolov5-l. So, yolov7 is more efficient.

Jul 22 '22 08:07 polar99

I also have the confuse. and attached image is the comparison of inference speed between yolov7 and yolov5s6 and yolov7-tiny and yolov5n6。 inference speed of yolov7 is 0.152s and yolov5s6 is 0.011s inference speed of yolov7-tiny is 0.039s and yolov5n6 is 0.007s

please help me to explain the reason of the result. tks

Jul 25 '22 04:07 zhengzhigang1979

I also have the confuse. and attached image is the comparison of inference speed between yolov7 and yolov5s6 and yolov7-tiny and yolov5n6。 inference speed of yolov7 is 0.152s and yolov5s6 is 0.011s inference speed of yolov7-tiny is 0.039s and yolov5n6 is 0.007s

please help me to explain the reason of the result. tks I also have the confuse. I do not believe yolov7 faster than yolov5.

Jul 26 '22 02:07 JNH-LD

When tested in an identical environment on a nVidia T4 GPU:

YOLOv7 (51.2% AP, 12.7ms) is 1.5x times faster and +6.3% AP more accurate than YOLOv5s6 (44.9% AP, 18.7ms)

https://colab.research.google.com/gist/AlexeyAB/56912451a33981d977ff9ea61025ae40/yolov7trtlinaom.ipynb#scrollTo=-tMYe8f27US9

!python test.py --data data/coco.yaml --img 640 --batch 1 --conf 0.001 --iou 0.65 --device 0 --weights yolov7.pt --name yolov7_640_val
...
Speed: 12.6/0.9/13.5 ms inference/NMS/total per 640x640 image at batch-size 1
...
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.512

!python val.py --data data/coco.yaml --img 1280 --batch 1 --conf 0.001 --iou 0.65 --device 0 --weights yolov5s6.pt --name yolov5s6_1280_val
...
Speed: 0.7ms pre-process, 18.7ms inference, 1.7ms NMS per image at shape (1, 3, 1280, 1280)
...
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.449

YOLOv7 (51.2% AP, 12.6ms) has almost the same accuracy but 4x times faster than YOLOv5m6 (51.3% AP, 49.1ms)

https://colab.research.google.com/gist/AlexeyAB/857c4859a7a27abca8775245884d1ecf/yolov7trtlinaom.ipynb

!python test.py --data data/coco.yaml --img 640 --batch 1 --conf 0.001 --iou 0.65 --device 0 --weights yolov7.pt --name yolov7_640_val
...
Speed: 12.6/0.9/13.5 ms inference/NMS/total per 640x640 image at batch-size 1
...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.512

!python val.py --data data/coco.yaml --img 1280 --batch 1 --conf 0.001 --iou 0.65 --device 0 --weights yolov5m6.pt --name yolov5m6_1280_val
...
Speed: 0.6ms pre-process, 49.1ms inference, 1.7ms NMS per image at shape (1, 3, 1280, 1280)
...
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.513

More over, YOLOv7-w6 1280x1280 (54.6% AP, 29ms) has comparable accuracy but 6.6x times faster than YOLOv5x6 1280x1280 (55.0% AP, 192ms)

Jul 27 '22 06:07 AlexeyAB

I tested YOLOv7 on NVIDIA GeForce GTX 1080 Ti and NVIDIA GeForce RTX 3070. On 3070, YOLOv7 inference speed is approximately 50% less than 1080. Consider this in speed tests.

Jul 27 '22 17:07 mkhoshbin72

We ran the inference in OpenCV using the ONNX converted models for a single image of size 640x640. All YOLOv7 versions seem to be slower than YOLOv4 and YOLOv5l. Any idea why this is the case?

Model Architecture	FPS on TITAN RTX
yolov7.pt	56.27
yolov7x.pt	56.18
yolov7-w6.pt	29.76
yolov7-e6.pt	25.31
yolov7-d6.pt	22.3
yolov7-e6e.pt	19.87
yolov5l	64.11
yolov4	67.84

Aug 01 '22 02:08 mohaghighat

It is strange that you get 56 FPS (18ms) for yolov7.pt on Titan RTX (130 TFlops-TC), while there is higher 79 FPS (12.6ms) on GPU T4 (65 TFLops-TC) while Titan RTX is twice more powerful GPU: https://colab.research.google.com/gist/AlexeyAB/857c4859a7a27abca8775245884d1ecf/yolov7trtlinaom.ipynb

YOLOv7 (51.2% AP, 12.6ms) has almost the same accuracy but 4x times faster than YOLOv5m6 (51.3% AP, 49.1ms)

There seems to be something wrong with the ONNX converter or the ONNX inference code.

Have you integrated NMS into YOLOv7-onnx model as shown in our readme file, and did you evaluate YOLOv5 without NMS?

What batch size, float precision, tesnor cores, export code, inference code, number of test images, warmup, nms, ... did you use?

Aug 01 '22 02:08 AlexeyAB

It is strange that you get 56 FPS (18ms) for yolov7.pt on Titan RTX (130 TFlops-TC), while there is higher 79 FPS (12.6ms) on GPU T4 (65 TFLops-TC) while Titan RTX is twice more powerful GPU: https://colab.research.google.com/gist/AlexeyAB/857c4859a7a27abca8775245884d1ecf/yolov7trtlinaom.ipynb

YOLOv7 (51.2% AP, 12.6ms) has almost the same accuracy but 4x times faster than YOLOv5m6 (51.3% AP, 49.1ms)

There seems to be something wrong with the ONNX converter or the ONNX inference code.

Have you integrated NMS into YOLOv7-onnx model as shown in our readme file, and did you evaluate YOLOv5 without NMS?

What batch size, float precision, tesnor cores, export code, inference code, number of test images, warmup, nms, ... did you use?

@AlexeyAB https://github.com/WongKinYiu/yolov7/issues/400#issue-1325396557, we tried with the OpenCV inference. but got the error mentioned in this issue. Also when inferencing with the ONNX runtime we got low FPS.

batch size = 1,
float precision = 16, tensor cores = 576, export code = https://github.com/WongKinYiu/yolov7/blob/main/export.py, inference code = used OpenCV function readNetFromONNX() and measure the elapsed time for single inference. Did that for a set of images (~500) and then got the average value

Because of the error mentioned in the issue we omit the --grid in the exporting command mentioned in the readme.

Aug 02 '22 14:08 Nuwan1654

I compared the speed and mAP of yolov7 and yolov5s6 on coco128 using RTX2060 (it is the same with T4, both have tensor core). for yolov7: python test.py --data data/coco128.yaml --img 640 --batch 1 --conf 0.001 --iou 0.65 --device 0 --weights yolov7.pt Screenshot from 2022-09-14 09-52-35 for yolov5s6: python val.py --data data/coco128.yaml --img 1280 --batch 1 --conf 0.001 --iou 0.65 --device 0 --weights yolov5s6.pt The conclusion is that the mAP of yolov7 is better, and the mAP of yolov7 under the input of 640 can exceed the mAP of yolov5s under the input of 1280, so the paper only pays attention to the comparison of the inference time of 640 and 1280, and do not care the comparison under the same resolution. This may be one of the reasons why the inference speed of yolov7 is slower than that of yolov5 in the above comparison, because they use the same resolution.

More importantly, yolov7 uses half inference by default, while yolov5 does not use it by default. So in the above experimental results, yolov7 seems to be faster than yolov5s6, but this is just an illusion.

for yolov5s6 half: python val.py --data data/coco128.yaml --img 1280 --batch 1 --conf 0.001 --iou 0.65 --device 0 --weights yolov5s6.pt --half Screenshot from 2022-09-14 10-03-35

So, yolov5 still has excellent speed performance under the input of 1280, but it is undeniable that the mAP of yolov7 under 640 is also excellent enough.

Supplement, yolov7's inference speed under fp32. Modify the parameter (half_precision) in the test function in the test.py to False and run python test.py --data data/coco128.yaml --img 640 --batch 1 --conf 0.001 --iou 0.65 --device 0 --weights yolov7.pt Screenshot from 2022-09-14 10-08-20

Sep 14 '22 14:09 zhjw0927

I tested YOLOv7 on NVIDIA GeForce GTX 1080 Ti and NVIDIA GeForce RTX 3070. On 3070, YOLOv7 inference speed is approximately 50% less than 1080. Consider this in speed tests.

Did you ever figure out a fix ?

Dec 27 '22 04:12 StefanCiobanu1989

yolov7 yolov7 copied to clipboard

yolov7 inference slower than yolov5

yolov7
yolov7 copied to clipboard