DFormer Question about input normalization

Congratulations on the great work showcased with Dformer! I have a few question related to fine-tuning your models with a new dataset. Which kind of normalization should I apply to my RGB images and depth maps before feeding them into the Dformer encoder? For RGB I assume a standard 0-1 normalization per image (i.e. dividing by 255.) but for depth maps I don't see an obvious strategy. Furthermore, I would probably obtain very bad results if your model was trained using a different normalization strategy. Thus, I am looking forward to you elucidating me. Thanks a lot! Best regards,

Lorenzo Mazza

Sep 14 '25 16:09 Lorenzo-Mazza

Thanks for your attention to our work!

If you are finetuning our trained weight with a new dataset, we recommend to keep consistent with our pretraining(code): transforms.Normalize( mean=torch.tensor(mean), std=torch.tensor(std)) Here the mean and std are IMAGENET_DEFAULT_MEAN = (0.485, 0.456, 0.406,0.48)、 IMAGENET_DEFAULT_STD = (0.229, 0.224, 0.225,0.28). The first three constants are for RGB while the last one is for depth.

I hope this can help you. If it not works, feel free to contact us for further discussion.

Best regards, Bowen Yin

Sep 15 '25 01:09 yinbow

Hi Yin, Thank you so much for the prompt answer and for the elucidation! Your answer was super helpful to me. I will follow up if I have additional questions. Best,

Lorenzo

Sep 15 '25 17:09 Lorenzo-Mazza