dinov3 How to use gram anchoring loss in my custom dataset?

Hello, I have successfully run the fast setup using ViT-L/16 in my own histopathology dataset, and I am now trying to conduct the gram anchoring step. I notice that the gram_crops_size usually twice larger than the global_crops_size, but I have processed all my input images as 224224 before. Should I reprocess them to 448448 as input, or just let the v2.RandomResizedCrop to resize them? By the way, is there any other things need to be noted?

Sep 30 '25 03:09 Extord1108

Hello, I have successfully run the fast setup using ViT-L/16 in my own histopathology dataset, and I am now trying to conduct the gram anchoring step. I notice that the gram_crops_size usually twice larger than the global_crops_size, but I have processed all my input images as 224_224 before. Should I reprocess them to 448_448 as input, or just let the v2.RandomResizedCrop to resize them? By the way, is there any other things need to be noted?

May I ask politely, regarding the production of your dataset, is it made in the imagenet format in stage1？thanks

Oct 15 '25 11:10 Chensihao-7

Hello, I have successfully run the fast setup using ViT-L/16 in my own histopathology dataset, and I am now trying to conduct the gram anchoring step. I notice that the gram_crops_size usually twice larger than the global_crops_size, but I have processed all my input images as 224_224 before. Should I reprocess them to 448_448 as input, or just let the v2.RandomResizedCrop to resize them? By the way, is there any other things need to be noted?

May I ask politely, regarding the production of your dataset, is it made in the imagenet format in stage1？thanks

Due to the particularity of pathological images, we haven't fully followed the official imagenet format, but we take it as a reference. The similarity is we use jpeg image format and load the image path as numpy arrays.

Oct 16 '25 06:10 Extord1108

Hello, I have successfully run the fast setup using ViT-L/16 in my own histopathology dataset, and I am now trying to conduct the gram anchoring step. I notice that the gram_crops_size usually twice larger than the global_crops_size, but I have processed all my input images as 224_224 before. Should I reprocess them to 448_448 as input, or just let the v2.RandomResizedCrop to resize them? By the way, is there any other things need to be noted?

May I ask politely, regarding the production of your dataset, is it made in the imagenet format in stage1？thanks

Due to the particularity of pathological images, we haven't fully followed the official imagenet format, but we take it as a reference. The similarity is we use jpeg image format and load the image path as numpy arrays.

Hello, I’m currently also experimenting with Gram Anchor training. I’d like to confirm if my understanding is correct — we first train stage 1 up to a certain step, then start the Gram Anchor phase, and after enabling it, the training continues from the previous checkpoint. However, I’ve encountered an issue: after switching to the Gram Anchor stage, my loss stops decreasing and the results don’t match those reported in the paper.

Oct 31 '25 08:10 Chensihao-7

@Extord1108 Hello,

Could you please share your training configuration file, along with details like the global batch size and how many data you use?

I am currently working on Stage 1 pre-training in a different domain, but I haven't been able to achieve good results so far. I would really appreciate your help and insights.

Dec 17 '25 09:12 youngtboy

@youngtboy We followed almost all the configurations in the file vitl_im1k_lin834.yaml to train our ViT/L-16 model and only modified the OFFICIAL_EPOCH_LENGTH, batch_size_per_gpu and lr to fit our dataset and gpus. We used aboud 5 million images and set the global batch size to 512 due to the GPU memory limitation. To be noted is that we loaded a pretrained model in our domain to initialize the model parameters and thus used a relatively small learning rate to fine-tune the model. Hope this can help you.

Dec 17 '25 10:12 Extord1108

Hello, I have successfully run the fast setup using ViT-L/16 in my own histopathology dataset, and I am now trying to conduct the gram anchoring step. I notice that the gram_crops_size usually twice larger than the global_crops_size, but I have processed all my input images as 224_224 before. Should I reprocess them to 448_448 as input, or just let the v2.RandomResizedCrop to resize them? By the way, is there any other things need to be noted?

May I ask politely, regarding the production of your dataset, is it made in the imagenet format in stage1？thanks

Due to the particularity of pathological images, we haven't fully followed the official imagenet format, but we take it as a reference. The similarity is we use jpeg image format and load the image path as numpy arrays.

Hello, I’m currently also experimenting with Gram Anchor training. I’d like to confirm if my understanding is correct — we first train stage 1 up to a certain step, then start the Gram Anchor phase, and after enabling it, the training continues from the previous checkpoint. However, I’ve encountered an issue: after switching to the Gram Anchor stage, my loss stops decreasing and the results don’t match those reported in the paper.

Have you solved this problem? During the Gram Anchor stage, the loss of our model was also abnormal. But as our main task is classification while gram anchor loss is designed for dense prediction task, we now only use the training result of the first stage.

Dec 17 '25 11:12 Extord1108

@Extord1108 I understand that the prerequisite for starting the second stage is that the first stage can already produce reasonably good dense features. When you mentioned “successfully run the fast setup using ViT-L/16,” did you mean that you observed a clear and satisfactory loss-decreasing trend (to what level did the final total loss drop?), or that you could see qualitatively good dense features through visualization?

Dec 18 '25 08:12 youngtboy

@Extord1108 I understand that the prerequisite for starting the second stage is that the first stage can already produce reasonably good dense features. When you mentioned “successfully run the fast setup using ViT-L/16,” did you mean that you observed a clear and satisfactory loss-decreasing trend (to what level did the final total loss drop?), or that you could see qualitatively good dense features through visualization?

The loss of our model decreased like the paper demonstrated. The global dino loss decreased to around 6.7 and the iBot loss decreased to around 3.5 at about 45k iters.

Dec 19 '25 01:12 Extord1108