Performance comparison against Yolo11
Search before asking
- [x] I have searched the RF-DETR issues and found no similar bug report.
Bug
I am testing the nano model inference time against YOLO11n. The input images are 384x384pixels. I tested it with my intel 12h gen core i7 igpu and rtx3050. I tested with fix_nontype_in_optimize_call branch.
I am getting an inference time between 25-45msec with the rtx and between 35-60msec with the igpu.
Is this an expected inference time? I was expecting far better than that after looking at the comparison charts.
Environment
Windows 11 RFDETR: fix_nontype_in_optimize_call Python 3.13 Torch 2.8.0+cu129 Core I7 12700H Geforce RTX3050
Minimal Reproducible Example
#!/usr/bin/env python3 import os import cv2 import time from pathlib import Path from PIL import Image from rfdetr import RFDETRNano import torch
---------------- CONFIG ----------------
CONF_THRESHOLD = 0.7 DIRECTORY = "datasetcropped/valid/"
----------------------------------------
model = RFDETRNano(resolution=384, device='cuda') model.optimize_for_inference(dtype=torch.float16) #25-45msec #model.optimize_for_inference() #cuda 28-50 for fname in os.listdir(DIRECTORY): if not fname.lower().endswith((".jpg", ".png", ".jpeg")): continue path = os.path.join(DIRECTORY, fname) img = cv2.imread(path) start_time = time.time() predictions = model.predict(img, confidence=CONF_THRESHOLD) end_time = time.time() print(f"YOLO inference took {end_time - start_time:.3f} seconds")
Additional
(rfdter) c:\tools\train\rfdter>python inferencenative.py
Using a different number of positional encodings than DINOv2, which means we're not loading DINOv2 backbone weights. This is not a problem if finetuning a pretrained RF-DETR model.
Using patch size 16 instead of 14, which means we're not loading DINOv2 backbone weights. This is not a problem if finetuning a pretrained RF-DETR model.
Loading pretrain weights
loss_type=None was set in the config but it is unrecognized. Using the default loss: ForCausalLMLoss.
TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
TracerWarning: torch.as_tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\pytorch\aten\src\ATen\native\TensorShape.cpp:4324.)
TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
YOLO inference took 2.269 seconds
YOLO inference took 0.064 seconds
YOLO inference took 0.057 seconds
YOLO inference took 0.057 seconds
YOLO inference took 0.062 seconds
YOLO inference took 0.030 seconds
YOLO inference took 0.031 seconds
YOLO inference took 0.026 seconds
YOLO inference took 0.057 seconds
YOLO inference took 0.033 seconds
YOLO inference took 0.050 seconds
YOLO inference took 0.055 seconds
YOLO inference took 0.048 seconds
YOLO inference took 0.025 seconds
YOLO inference took 0.027 seconds
YOLO inference took 0.027 seconds
YOLO inference took 0.056 seconds
YOLO inference took 0.050 seconds
YOLO inference took 0.036 seconds
YOLO inference took 0.057 seconds
YOLO inference took 0.037 seconds
YOLO inference took 0.026 seconds
YOLO inference took 0.033 seconds
YOLO inference took 0.028 seconds
YOLO inference took 0.025 seconds
YOLO inference took 0.035 seconds
YOLO inference took 0.032 seconds
YOLO inference took 0.030 seconds
YOLO inference took 0.026 seconds
YOLO inference took 0.025 seconds
YOLO inference took 0.025 seconds
YOLO inference took 0.034 seconds
YOLO inference took 0.024 seconds
YOLO inference took 0.040 seconds
YOLO inference took 0.025 seconds
Are you willing to submit a PR?
- [x] Yes, I'd like to help by submitting a PR!
The results in the table are measured using TensorRT on a T4 via our open source reproducible benchmarking repo https://github.com/roboflow/single_artifact_benchmarking
I don't know what latencies should be on a different GPU without TRT. I would encourage you to try the benchmarking tool! Though I am realizing that I'm not sure I remembered to upload official ONNX graphs for RF-DETR-Seg (Preview)
I can achieve similar results with your benchmark tool on my nvidia rtx 3050 that the one you publish on this website My concern is regarding the igpu:
- yolo11nano 320x320 achieve 10ms of inference time once deploy with Frigate (15ms on the rtx of your benchmark).
- rf-detr nano fp16 achieve 6ms with the benchmark tools. Once deployed to frigate (but I also using the benchmark_app of openvino) I get 80ms and that's a lot more. I also try simplify without a lot of progress
Cool. We haven't optimized at all for frigate, not sure what would go into that
There is nothing frigate specific here, but more openvino export/tuning for performance
Got it. We use attention pretty heavily so if openvino doesn't have an optimized attention implementation such as flash attention then deployment will suffer
@isaacrob-roboflow -
I'd like to add my vote on OpenVino here as well. With Intel being the most widely deployed end-user chips, and with OpenVino dramatically boosting performance on top of the general ONNX model format I'd highly encourage you to look deeply at this front.
I would be very interested in seeing the performance benchmark comparing the ONNX model.
With that said, I was blown away with the performance (accuracy) of the model and given all the talk on speed and how it compares to Yolo (and that it removes the post-processing latencies, etc.) I was eagerly awaiting to give it a shot on iGPU and like the original author here, after giving it a try I was incredibly upset and was a major setback. I truly hope this project succeeds and that you revisit the exclusive focus on T4, in particular with all the talk about "edge" processing.
I'd also mentioned that we tried to compare using Nvidia RTX 5000 GPU, and Yolo was 30% faster.
Thank you for listening!