Question: D‑Fine training is much slower than YOLOv8/DAMO on 2×A100 (DDP)
Hi @ArgoHA ! Thank you for this amazing work, I really enjoy this repo style And I'm getting strong accuracy from D-Fine model But I'm seeing slow training and I did small comparison for YOLOv8/ DAMO-yolo/D-fine
TL;DR
- Despite similar/better Params/GFLOPs, D‑Fine (s) trains ~4–5× slower than YOLOv8m and ~1.8× slower than DAMO in my setup
About my setup:
Dataset:
- ~6.5k images total.
- Train size: 3927
- Val size: 2656
- 26 classes
Training h-params:
- Batch: 64
- imgsz: 640
- epochs: 120
- amp: False
Some notes:
-
All runs were executed sequentially (no competing jobs) to avoid CPU/I/O contention.
-
All training were executed in DDP setup using 2 gpus A100 80gb
-
AMP disabled because training breaks, I saw same issues in this and original D-Fine repo
-
For D-Fine were implemented custom DDP training script Training time comparisons for single gpu and ddp:
| Setup | Epoch train time, sec | Epoch val time, sec | V-ram consumption (GB/per GPU) |
|---|---|---|---|
| Singe GPU (orig as in repo) | 260 | 217 | 43.5 |
| Custom DDP (2 GPUs) | 187 | 145 | 23 |
- DDP speed‑up is ~30%
Cross‑model comparison (same data & hparams)
| Model | Params, M | GFLOPs | V-ram consumption (GB/per GPU) |
Epoch train time, sec | Epoch val time, sec | Epoch total time, sec | Total training time, hours | Map50-95 |
|---|---|---|---|---|---|---|---|---|
| YOLO v8 m | 25.8 | 79.1 | ~16 | 45 | 10 | 55 | 2.0 | 0.79 |
| Damo YOLO (damoyolo_tinynasL45_L_508) |
43.43 | 100.44 | ~30 | 200 | 50 | 250 | 6.8 | 0.82 |
| D-Fine (s model) | 10 | 25 | ~23 | 200 | 150 | 350 | 11 | 0.84 |
* You may notice the total hours for DAMO (250 s/epoch × 120 = 8.3 h) differs from the 6.8 h logged in my run.
This is because train/val/total epoch times shown in the table are averages over the first 3 epochs
Qestion
- Is this slowdown expected due to specific architecture/loss/augmentation choices in D‑Fine’s training pipeline?
Thanks a lot for any pointers, I’m happy to run additional experiments if that helps
Hey, thanks for your experiments. Did you happen to compare this repo with original d-fine? I wonder if they are similar in training speed. Overall I noticed that too, but wasn't able to find the cure, although there are some ideas.
So if original D-FINE is close in speed - it's an architectural thing, because I took only model architecture and loss functions from original repo, everything else I rewrote from scratch. It's also interesting to know if RT-DETR is faster or not
Yes, I did compare this repo with the original D-Fine But with the original implementation I ran into an issue - evaluation metrics stayed close to zero, it was couple of months ago and there was an open issue about it at the time. I can re-check now to see if it has been resolved
Regarding RT-DETR, I can also run a comparison, though it might take me some time Do you have any recommendations on training framework to use? original RT-DETR repo or mb you know another implementation that might be more optimized?
I don't want to waste your time, but with RT-DETR it's an interesting data point as it is also a transformer based model. So it would make sense if it was also slow. I don't know other repos, only original one.
With original D-FINE I know about that issue, but maybe we can just check how long does it take to go through first epoch to compare the speed.
I think it's also a good idea for me to profile training and see what can be improved. Not sure when I will get to it though. I will let you know if I get any new info on this topic
Jfyi- rt-detr and custom d-fine training on multi GPU (we added the multi-GPU support) has about the same speed per epoch.