custom_d_fine icon indicating copy to clipboard operation
custom_d_fine copied to clipboard

Question: D‑Fine training is much slower than YOLOv8/DAMO on 2×A100 (DDP)

Open proevgenii opened this issue 4 months ago • 4 comments

Hi @ArgoHA ! Thank you for this amazing work, I really enjoy this repo style And I'm getting strong accuracy from D-Fine model But I'm seeing slow training and I did small comparison for YOLOv8/ DAMO-yolo/D-fine

TL;DR

  • Despite similar/better Params/GFLOPs, D‑Fine (s) trains ~4–5× slower than YOLOv8m and ~1.8× slower than DAMO in my setup

About my setup:

Dataset:

  • ~6.5k images total.
  • Train size: 3927
  • Val size: 2656
  • 26 classes

Training h-params:

  • Batch: 64
  • imgsz: 640
  • epochs: 120
  • amp: False

Some notes:

  • All runs were executed sequentially (no competing jobs) to avoid CPU/I/O contention.

  • All training were executed in DDP setup using 2 gpus A100 80gb

  • AMP disabled because training breaks, I saw same issues in this and original D-Fine repo

  • For D-Fine were implemented custom DDP training script Training time comparisons for single gpu and ddp:

Setup Epoch train time, sec Epoch val time, sec V-ram consumption
(GB/per GPU)
Singe GPU (orig as in repo) 260 217 43.5
Custom DDP (2 GPUs) 187 145 23
  • DDP speed‑up is ~30%

Cross‑model comparison (same data & hparams)

Model Params, M GFLOPs V-ram consumption
(GB/per GPU)
Epoch train time, sec Epoch val time, sec Epoch total time, sec Total training time, hours Map50-95
YOLO v8 m 25.8 79.1 ~16 45 10 55 2.0 0.79
Damo YOLO
(damoyolo_tinynasL45_L_508)
43.43 100.44 ~30 200 50 250 6.8 0.82
D-Fine (s model) 10 25 ~23 200 150 350 11 0.84

* You may notice the total hours for DAMO (250 s/epoch × 120 = 8.3 h) differs from the 6.8 h logged in my run.
This is because train/val/total epoch times shown in the table are averages over the first 3 epochs

Qestion

  • Is this slowdown expected due to specific architecture/loss/augmentation choices in D‑Fine’s training pipeline?

Thanks a lot for any pointers, I’m happy to run additional experiments if that helps

proevgenii avatar Aug 12 '25 21:08 proevgenii

Hey, thanks for your experiments. Did you happen to compare this repo with original d-fine? I wonder if they are similar in training speed. Overall I noticed that too, but wasn't able to find the cure, although there are some ideas.

So if original D-FINE is close in speed - it's an architectural thing, because I took only model architecture and loss functions from original repo, everything else I rewrote from scratch. It's also interesting to know if RT-DETR is faster or not

ArgoHA avatar Aug 13 '25 08:08 ArgoHA

Yes, I did compare this repo with the original D-Fine But with the original implementation I ran into an issue - evaluation metrics stayed close to zero, it was couple of months ago and there was an open issue about it at the time. I can re-check now to see if it has been resolved

Regarding RT-DETR, I can also run a comparison, though it might take me some time Do you have any recommendations on training framework to use? original RT-DETR repo or mb you know another implementation that might be more optimized?

proevgenii avatar Aug 13 '25 15:08 proevgenii

I don't want to waste your time, but with RT-DETR it's an interesting data point as it is also a transformer based model. So it would make sense if it was also slow. I don't know other repos, only original one.

With original D-FINE I know about that issue, but maybe we can just check how long does it take to go through first epoch to compare the speed.

I think it's also a good idea for me to profile training and see what can be improved. Not sure when I will get to it though. I will let you know if I get any new info on this topic

ArgoHA avatar Aug 13 '25 16:08 ArgoHA

Jfyi- rt-detr and custom d-fine training on multi GPU (we added the multi-GPU support) has about the same speed per epoch.

saumitrabg avatar Aug 29 '25 03:08 saumitrabg