rf-detr onnx model optimized for CPU

Search before asking

[x] I have searched the RF-DETR issues and found no similar feature requests.

Description

I tested the RF-DETR Nano model (in ONNX format, both quantized and non-quantized) at a resolution of 320×320 on CPU. It takes around 180 ms per inference. In comparison, YOLOv8 Nano at 640×640 resolution runs in under 100 ms per inference on the same setup. I am running onnx model using onnxruntimr lib in c++.

Is there any way to further optimize RF-DETR Nano to improve its inference speed on CPU?

Use case

No response

Additional

No response

Are you willing to submit a PR?

[x] Yes I'd like to help by submitting a PR!

Jul 26 '25 13:07 senstar-hsoleimani

Does that include YOLO NMS?

Jul 26 '25 16:07 isaacrob-roboflow

No, this does not include Yolo NMS. with NMS it takes around 105 to 110 ms.

Jul 26 '25 18:07 senstar-hsoleimani

Cool. What's the confidence threshold? NMS is faster with higher confidence but results in lower mAP. Academic standard is to 'tune' confidence to get better latency / mAP trade off, usually ending up with .01

Also, can you verify using 32 or 16 bit precision? Ultralytics exports its checkpoints in 16 bit, not sure if that means your model is running in that.

Jul 26 '25 23:07 isaacrob-roboflow

I set the confidence to a small value. I do not want to miss any object.

The yolov8 nano I am using has not been quantized (it is fp32), but on CPU it is faster than rf-detr (quantized or non-quantized versions).

My question is about optimizing rf-detr to run faster (just running, not pre or post processing) on CPU.

Jul 26 '25 23:07 senstar-hsoleimani

I'm also interested in benchmarking the Nano model but I can't get the inference package to build on my Mac. Can anyone share the ONNX weights with me?

Jul 27 '25 08:07 jonashaag

@senstar-hsoleimani I put some goodies in this issue: https://github.com/roboflow/rf-detr/issues/289

I think yolov8 is one of the last pure quantizable models out of the box. Although I'm still exploring that.

Transformer models do not generally play nice with CPU.

Jul 27 '25 15:07 rlewkowicz

I buy that the transformers won't play as well with CPU as the CNNs. For this release we focused purely on runtime on NVIDIA GPU, even going so far as to benchmark with and without CUDA graphs in TRT. (Fun fact, CUDA graphs were not considered in D-FINE's benchmark but when you DO use them D-FINE is not faster than LW-DETR!)

I will say though that we plan to target more devices in future releases.

As for quantization, we found that YOLO11 decays slightly with fp16 while the transformer ones don't. (We have slightly different implementation on platform for RF-DETR that we know doesn't decay, not certain about the open source one.) In the plot we include the fp16 mAP. Our reported mAP for YOLO is 2-4 mAP worse than the official numbers because of decay due to fp16, tuning NMS to be faster, and the fact that ultralytics uses a slower but more accurate NMS variant during validation than prediction. Those subtleties are important to understand when claiming than it is meaningfully faster than other models on CPU: its performance may be decaying in unexpected ways and I would encourage you to benchmark it with pycocotools or supervision.

Finally, we did no experiments with int8. I'm now curious if we can get it working, sounds like that would be a benefit to the community.

Jul 27 '25 17:07 isaacrob-roboflow

I could take another stab at QAT. I think I tried using their FX graph stuff. I might have more luck doing eager.

It did convert successfully to a full end to end int8 flat buffer. But all the detections were off by about 20 pixels down and to the right. So maybe an issue with zero point, but I don't know how to describe it. I only understand the outside of concepts.

QAT might actually fix that.

After how much trouble I had with dfine, I decided not to play with it.

Then I switched over to Yolov12, but I had issues with Attn.

All of this is just with those NPUs though. I don't know anything about vino I don't know anything about Qualcomm etc

But I'm even fairly certain those are still mixed. I think they still fall back to float which I can't do.

So for pure int8, it was just a no go. In the issue I linked, there's a bunch of goodies for getting a semi functional int8 model.

Jul 27 '25 18:07 rlewkowicz

How was speed with INT8? Is it worth it in the first place, compared to YOLOv8/11?

Jul 27 '25 18:07 jonashaag

I was struggling to get it functional let alone do benchmarks.

I'm working on my own product, took a break for a short contract and am back. Figured I'd explore new models.

imo, the marginal improvements in map probably wont really matter so it's whatever tool chain fits your license and is easy to use. But I'm really just an advanced hobbyist in this space (unless I can actually produce a functional product).

Many of these models, when not training against 80/90 classes can handle domain specific targets incredibly well. Most of it gets down how you build your data sets.

Most benchmarks will put yolo 6/8 at the lowest latency.

Jul 27 '25 20:07 rlewkowicz

We'll be adding YOLOv8 latencies to our benchmark soon. Likely a little faster than 11.

Our new benchmark RF100-VL is designed to test how well object detectors transfer to real world datasets. You can read more here. We also benchmark RF-DETR, LW-DETR, D-FINE, and YOLO11 in the charts in this repo. You can see that DETR based models significantly outperform YOLO based models on real world datasets. We hypothesize that this is because DETR based models are more amenable to pretraining, which lets them transfer better to smaller datasets.

Also note that the mAP reported for the YOLO models on RF100-VL doesn't take into account the decay factors listed above, so in a real deployment it is likely worse, similar to how we observe lower COCO scores.

Jul 27 '25 20:07 isaacrob-roboflow

Has anyone tried quantizing the model to UINT8 / UINT16 and running on a qualcomm NPU? I tried doing it when RF-DETR was just released but creating onnx session with qnn backend on device just didn't work. Lots of unsupported node errors and session initialization process just freezes. Tried excluding unsupported nodes from quantization and running them on CPU instead, but no success either

Curious of anything has changed

Nov 15 '25 19:11 danylomm