CLIP
CLIP copied to clipboard
Implementation details in few-shot ImageNet evaluation
Thanks for the amazing paper! I have a few questions about the details of the results in Figure 6 in the paper. I tried to use the linear probing code in the README.md
on CIFAR to run one-shot evaluation on ImageNet. However, my evaluation implementation, with VIT-B/32
, only gets 27% accuracy on ImageNet validation set (not ImageNet V2) instead of about 45% as mentioned in the paper. Therefore, I suspect that I missed some details in the paper experiments and I have a few questions:
- What CLIP model is figure 6 using? The zero-shot accuracy makes me think that it's
ViT-B/32
. However, in my personal evaluation,ViT-B/32
does not provide a 45% one-shot accuracy. - Could you disclose the
C
parameter in sklearn for this experiment? Although on my end the performance does not vary much in differentC
values, I will appreciate it if this value could be provided. - How are the instances per class selected? I select the first sample of each class in ImageNet training set, which leads to a dataset size of 1000 and each with feature dimension 768 (which is obtained before the final projection to 512 dimensions). Does my selection approach sound reasonable?
Thanks again for the amazing paper and thanks in advance for helping me.
I tried run testing on CIFAR100 following by README.md (Zero-Shot Prediction)
and also can't achieve the performance in paper. did you solve this problem?
Please refer to https://github.com/openai/CLIP/blob/fcab8b6eb92af684e7ff0a904464be7b99b49b88/notebooks/Prompt_Engineering_for_ImageNet.ipynb for this concern.