Co-DETR Questions on img_size, img_scale in co_dino_5scale_vit_large

I have two questions related to image size when training Co-DETR with ViT-L:

Could you explain why the backbone's img_size argument is set to 1536(link)? The maximum img_scale size is (1536, 2400), and I’m not entirely clear on the connection between them. Since the image sizes can range from 480 to 1536 due to resizing, I would appreciate more details on how this works.
A few days ago, you updated the img_scale values in the train_pipeline(link). Should the img_scale values in the test_pipeline also be updated?

Thanks!

Nov 08 '24 08:11 jih0-kim

For ViT with LSJ augmentation, the backbone's img_size argument should be equal to the actual image size. For ViT with DETR augmentation, this argument can be ignored.
In my experiments, a test image size of 2048x1280 achieves the best single-scale performance.

Nov 17 '24 13:11 TempleX98

Oh, so even though my actual image size is (height=1024, width=1920), using img_size=1536 doesn’t affect training a ViT with DETR, is that correct? I noticed that the img_size argument is used in the ViT backbone like this. How can the img_size be ignored?

I understand that training and testing with images of size 2048x1280 produced the best results for your model. Thank you for sharing your insights! What was your original image size? Was it larger than 2048x1280 and resized, or was it originally 2048x1280?

Nov 17 '24 14:11 jih0-kim

@TempleX98 Thanks for your help. Would it be possible to ask for your advice despite your busy schedule?

Nov 25 '24 15:11 jih0-kim

Questions on img_size, img_scale in co_dino_5scale_vit_large_coco.py