MAE-pytorch icon indicating copy to clipboard operation
MAE-pytorch copied to clipboard

knn eval of MAE

Open Dongshengjiang opened this issue 3 years ago • 13 comments

I eval vit_base of 500/1600 pretraining on imagenet1000 using knn metric. By loading all the pretained parameter with vit GAP method (not need cls token), the knn 20-NN result is 33.4 in imagenet100 dataset, which is very low and not match the accuracy of linear prob.

Dongshengjiang avatar Nov 18 '21 02:11 Dongshengjiang

have you tried the linear prob eval?

Dongshengjiang avatar Nov 18 '21 02:11 Dongshengjiang

Emm, how about end-to-end finetuning?

pengzhiliang avatar Nov 18 '21 02:11 pengzhiliang

I just tried your latest updata of end-to-end finetuning, it seems good. But I think linear prob still is a metric cannot avoided.

Dongshengjiang avatar Nov 18 '21 02:11 Dongshengjiang

Thanks for you suggestions, we actually ignore the linear prob metric. In fact, I am not very familiar with Linear Prob. Can you help me try to implement it? Thank you very much!

pengzhiliang avatar Nov 18 '21 03:11 pengzhiliang

https://github.com/facebookresearch/dino/blob/main/eval_linear.py dino contains the code of knn and linear eval code. I am not sure how to treat the cls token, as the linear prob only finetune the last head, but for MAE , the cls token is not pre-trained.

Dongshengjiang avatar Nov 18 '21 03:11 Dongshengjiang

Ok, thank you~

pengzhiliang avatar Nov 18 '21 03:11 pengzhiliang

Hello, have you finished the end-to-end fine-tuning of vit-base/1600e? Can you tell me the result? Thank you!

pengzhiliang avatar Nov 19 '21 01:11 pengzhiliang

Hi, I finished the epoch 1600 training, but I only got fine-tuning result of 83.15 for epoch 1400 and 82.97 for epoch 1600. which is lower than your reported epoch 400 and the paper results.

Dongshengjiang avatar Nov 22 '21 23:11 Dongshengjiang

From your pretrained log of vit_base, I found your max learning rate is 0.0024, is you run with 128X32 batch size? according to the code: args.lr = args.lr * total_batch_size / 256, which should be 0.0006 for batchsize of 128X8.

Dongshengjiang avatar Nov 23 '21 00:11 Dongshengjiang

Ok, that is very strange. I run vit-base with 512 x 8 = 4096, where the lr: 1.5e-4 * 512 * 8 / 256 = 0.0024.

pengzhiliang avatar Nov 23 '21 00:11 pengzhiliang

ok, I will try your setting to reimplement your results for epoch 400. But the results of epoch 1600 is on batchsize 4096, still not good enough. the ft accuracy incrase slowly with epoch: 82.71/200, 82.82/400,82.87/600, 83/800,82.78/1000,82.96/1200,83.15/1400,82.97/1600.

Dongshengjiang avatar Nov 23 '21 02:11 Dongshengjiang

OK, thank you for your so much experiments! Maybe there is still some problems, I will check it carefully.

pengzhiliang avatar Nov 23 '21 08:11 pengzhiliang

@Dongshengjiang Have you tried the LinearProbe evaluation with cls token?

The paper said: As ViT has a class token [16], to adapt to this design, in our MAE pre-training we append an auxiliary dummy token to the encoder input. This token will be treated as the class token for training the classifier in linear probing and fine-tuning.

It seems that the author just adds a dummy token when pre-training, and directly uses it as the feature for linear probing.

Harick1 avatar Dec 12 '21 14:12 Harick1