Hi @ArgoHA ! Thank you for this amazing work, I really enjoy this repo style And I'm getting strong accuracy from D-Fine model But I'm seeing slow training and I did small comparison for YOLOv8/ DAMO-yolo/D-fine

TL;DR

Despite similar/better Params/GFLOPs, D‑Fine (s) trains ~4–5× slower than YOLOv8m and ~1.8× slower than DAMO in my setup

About my setup:

Dataset:

~6.5k images total.
Train size: 3927
Val size: 2656
26 classes

Training h-params:

Batch: 64
imgsz: 640
epochs: 120
amp: False

Some notes:

All runs were executed sequentially (no competing jobs) to avoid CPU/I/O contention.
All training were executed in DDP setup using 2 gpus A100 80gb
AMP disabled because training breaks, I saw same issues in this and original D-Fine repo
For D-Fine were implemented custom DDP training script Training time comparisons for single gpu and ddp:

Setup	Epoch train time, sec	Epoch val time, sec	V-ram consumption (GB/per GPU)
Singe GPU (orig as in repo)	260	217	43.5
Custom DDP (2 GPUs)	187	145	23

DDP speed‑up is ~30%

Cross‑model comparison (same data & hparams)

Model	Params, M	GFLOPs	V-ram consumption (GB/per GPU)	Epoch train time, sec	Epoch val time, sec	Epoch total time, sec	Total training time, hours	Map50-95
YOLO v8 m	25.8	79.1	~16	45	10	55	2.0	0.79
Damo YOLO (damoyolo_tinynasL45_L_508)	43.43	100.44	~30	200	50	250	6.8	0.82
D-Fine (s model)	10	25	~23	200	150	350	11	0.84

* You may notice the total hours for DAMO (250 s/epoch × 120 = 8.3 h) differs from the 6.8 h logged in my run.
This is because train/val/total epoch times shown in the table are averages over the first 3 epochs

Qestion

Is this slowdown expected due to specific architecture/loss/augmentation choices in D‑Fine’s training pipeline?

Thanks a lot for any pointers, I’m happy to run additional experiments if that helps

Aug 12 '25 21:08 proevgenii

Hey, thanks for your experiments. Did you happen to compare this repo with original d-fine? I wonder if they are similar in training speed. Overall I noticed that too, but wasn't able to find the cure, although there are some ideas.

So if original D-FINE is close in speed - it's an architectural thing, because I took only model architecture and loss functions from original repo, everything else I rewrote from scratch. It's also interesting to know if RT-DETR is faster or not

Aug 13 '25 08:08 ArgoHA

Yes, I did compare this repo with the original D-Fine But with the original implementation I ran into an issue - evaluation metrics stayed close to zero, it was couple of months ago and there was an open issue about it at the time. I can re-check now to see if it has been resolved

Regarding RT-DETR, I can also run a comparison, though it might take me some time Do you have any recommendations on training framework to use? original RT-DETR repo or mb you know another implementation that might be more optimized?

Aug 13 '25 15:08 proevgenii

I don't want to waste your time, but with RT-DETR it's an interesting data point as it is also a transformer based model. So it would make sense if it was also slow. I don't know other repos, only original one.

With original D-FINE I know about that issue, but maybe we can just check how long does it take to go through first epoch to compare the speed.

I think it's also a good idea for me to profile training and see what can be improved. Not sure when I will get to it though. I will let you know if I get any new info on this topic

Aug 13 '25 16:08 ArgoHA

Jfyi- rt-detr and custom d-fine training on multi GPU (we added the multi-GPU support) has about the same speed per epoch.

Aug 29 '25 03:08 saumitrabg

Question: D‑Fine training is much slower than YOLOv8/DAMO on 2×A100 (DDP)

TL;DR

About my setup:

Dataset:

Training h-params:

Some notes:

Cross‑model comparison (same data & hparams)

Qestion