CLIP [Question] Performance gap on zero-shot prediction with ILSVRC2012

[Question] Performance gap on zero-shot prediction with ILSVRC2012

Open kuri-leo opened this issue 2 years ago • 0 comments

Hi there,

Thanks for your contribution and open-source of this research, its capability of zero-shot prediction is really amazing.

I tested the zero-shot prediction performance on ILSVRC2012 following the example code, and I got top-1 accuracy of 73.75%@train and 72.36%@val, which is poorer than 76.2% reported in the paper. Similar gap on different datasets was reported in https://github.com/openai/CLIP/issues/164 , https://github.com/openai/CLIP/issues/165 , https://github.com/openai/CLIP/issues/166 and https://github.com/openai/CLIP/issues/167

Furthermore, I would like to confirm whether this gap comes from the generation process of the text prompt.

As described in the paper:

For each dataset, we use the names of all the classes in the dataset as the set of potential text pairings and predict the most probable (image, text) pair according to CLIP.

In which the text prompt is generated from the prompt template A photo of a {label}. While ImageNet officially provides some long label name, e.g. great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias, some simpler names are created. I've tested the names from Google and anishathalye, where the respective performance of top-1 accurancy is 71.82%@val and 73.14%@val (this one is higher than the original).

I would be appreciated if you or the community can offer any hints on this :-)

Thanks in advance and have a nice weekend. Leo

May 20 '22 03:05 kuri-leo

CLIP CLIP copied to clipboard

[Question] Performance gap on zero-shot prediction with ILSVRC2012

CLIP
CLIP copied to clipboard