D-FINE Why don't you keep image ratio?

Is there a reason you train D-FINE without keeping the image ratio? You just use resize function getting image to square, but usually detectors use letterbox like:

                A.LongestMaxSize(max_size=max(self.target_h, self.target_w)),
                A.PadIfNeeded(
                    min_height=self.target_h,
                    min_width=self.target_w,
                    border_mode=cv2.BORDER_CONSTANT,
                    value=(114, 114, 114),
                ),

Is there a reason why you are not doing that and I should not add it to the training pipeline?

Jan 09 '25 14:01 ArgoHA

I see that in torch inference code you do resize without keeping the ratio (as you do during training):

transforms = T.Compose([
        T.Resize((640, 640)),
        T.ToTensor(),
    ])

But for onnx inference you do "Resizes an image while maintaining aspect ratio and pads it". Is there a reason for that? I would assume you lose accuracy if you train on squeezed images and then keep ratio during onnx inference.

Overall I really like your work and would like to contribute. What do you think about this aspect ratios issue?

I would do this: implement aspect ratio preservation as a flag for training and inference. During inference I would also cut grey paddings to not waste time on computing 114 pixels (it worked great for me before, with several yolo models)

Jan 10 '25 16:01 ArgoHA

@Peterande Here are things I would like to work on:

Image ratio flag for train and letterbox with cut paddings for infer.
Mosaic augmentation usage during training (I see that you have code for it, but it is not being used)
Unified configs (they are too spread out, changing image size should be in 1 place I believe)
Add more metrics (Precision, Recall, F1, TPs, FPs, FNs)
Add wandb integration
Unified infer classes for Torch / TensorRT / OpenVino (preprocessing, infer, postprocessing. Get raw image, return processed results)
Ability to pick metrics that you want to save best model on.
Debug flag to save part of 1) Preprocessed images with annotations (to know what you feed your model), 2) Images with model predictions during validation (to see what model predicts during training).
Maybe clean training verbose, add something like tqdm and ETA for the whole training.

Let me know if you guys are interested in any of these or have other ideas for contribution.

Jan 11 '25 10:01 ArgoHA

We are excited about these ideas! They all seem super valuable and will take our project to the next level. We're looking forward to your contributions with great anticipation.

Jan 11 '25 15:01 HebeiFast

@ArgoHA I also noticed the aspect ratio issue. Let me know if I can help.

Jan 13 '25 08:01 lz1004

@lz1004 I spent some time implementing this model in my custom training pipeline to better understand the details. I'll start implementing some features for this repo soon. I will let you know if I need some help.

For now the only issue I see is NaNs in preds when I train with my custom pipeline with AMP turned on, but I think it should be solved with gradient clamping. I did not see this issue with this repo, so we should be good.

Jan 25 '25 12:01 ArgoHA

@HebeiFast May I know why during training you don't always use the specified input size? I was training with 960x960 and I see a lot of samples with higher or lower img size. Is there a purpose in that?

@lz1004 do you have an idea?

Feb 01 '25 19:02 ArgoHA

@ArgoHA in the standard configs, multiscale is turned on until epoch 71. This regularization technique is often used in all sorts of model trainings. In yolox, e.g., input sizes during training often ranges ebtween original size - 3x32 and original size + 3x32. It can lead to better generalization.

Feb 01 '25 19:02 harmluSICKAG

One more question. Have you ever encountered case when confidence scores are getting lower after couple of epochs and metrics drop down? Like epoch 1 - mAP 0.5, epoch 3 - 0.2

I am still messing around with the model in a custom pipeline. Can't understand what is wrong. Same code worked on a dataset with 13 classes, now testing on smaller objects, 2 classes and getting horrible results.

With your original pipeline I don't see any issues, just asking in case it was a known issue.

UPD: Answer is that it can happen if focal_loss is True, but matched gets that flag as False :)

Feb 01 '25 20:02 ArgoHA

UPD: Answer is that it can happen if focal_loss is True, but matched gets that flag as False :)

@ArgoHA, what do you mean by that? Is it an error when initializing the config and where in the code it occurs?. I also get bad results with small objects, what helped you to improve the result relative to the base config?

Feb 11 '25 12:02 KozlovKY

@KozlovKY I didn't have any issues with this repo, I just took the model and loss function and put it into my custom training pipeline. I just had a bug there, I was using focal loss (by default), but didn't pass that flag to the matcher. In original repo there is no issue with that. Did you check how ultralytics (for example) trains on the same dataset?

Feb 11 '25 13:02 ArgoHA

Did you check how ultralytics (for example) trains on the same dataset?

@ArgoHA, Yes, ultralytics converges faster and the quality is better, I have high resolution images, maybe that is crucial factor, thinking of trying to increase the queries number as an experiment, but haven't had a time yet

Feb 11 '25 13:02 KozlovKY

I tried 1280x1280 and got better results than ultralytics. Did you check your dataset is converted correctly? I usually save debug images after preprocessing with labels to see what goes into the model

Feb 11 '25 14:02 ArgoHA

I tried 1280x1280 and got better results than ultralytics. Did you check your dataset is converted correctly? I usually save debug images after preprocessing with labels to see what goes into the model

i tried 1920x1920 and visualize my inputs it's correct, after 14 epochs the metric is stuck, maybe i need more time to convergence, but it's a long time

Feb 11 '25 15:02 KozlovKY

@Peterande is there any reason why you chose to pretrain model without keeping the image ratios and doing a simple resize? I notice that D-FINE is a very strong model, but if I train it with preserved aspect ratios, I get worse results than some yolo models. I guess it is happening because pretrained weights are used to a simple resize. I hope there is a chance to retrain models from scratch with aspect ratios preserved. Let me know if I can help with it (I only have 3060, sorry, can't help with resources). Meanwhile I will implement some features that I think make most sense at the moment.

Feb 12 '25 10:02 ArgoHA

@ArgoHA Do you have any news regarding the aspect ratio topic? I am unsure myself what is best for training? It seems so wrong to use squishing however there is probably a reason why @Peterande choose to squish and not to letterbox...

Would really love to hear from your experience @ArgoHA ! :)

Sep 12 '25 09:09 LemonGraZ

My guess is simplicity.

Sep 12 '25 09:09 lz1004

@lz1004 Hm okay, but one should keep it consistent i guess, right? So when training with squish also inference with squish?

But swapping training transformation out in the yaml with letterbox should be too difficult either...

Sep 12 '25 11:09 LemonGraZ

I tried training with letterbox on my custom data starting with the pretrained coco weights for a d-fine nano, and the results were not good.

Sep 12 '25 11:09 lz1004

Okay, so you @lz1004 recommend to stay with squishing for both training and inference?

Sep 12 '25 11:09 LemonGraZ

Yes, in my experience with the repo, I would recommend that. All projects on custom data did well with this combo and no parameter adjusting (except epochs).

Sep 12 '25 11:09 lz1004

Perfect, thank you!

Sep 12 '25 11:09 LemonGraZ

@LemonGraZ I had a number of things I didn't like, so I actually recreated D-FINE from scratch in my repo. Answering your question shortly:

Inference as you finetuned
Finetune ideally as model was pretrained
squashing can be better in comparison of accuracy and speed if you run inference with grey paddings for inference with preserved ratio.

I wrote an article about object detection, you can check Letterbox or simple resize? part, I attach results of my experiments with numbers.

Sep 12 '25 11:09 ArgoHA

as a quick training I would also suggest to keep the simple resize for both training and inference.

Sep 12 '25 11:09 ArgoHA