The working mechanism of the classifier
Dear author, thank you very much for your excellent work. I have a question that I would like to ask you. Is the classifier designed to calculate the cosine similarity between images and text in the same way as CLIP, or is it designed differently? I don't seem to have found detailed information on this part.
Hi there, thank you for your interest in our work. Yes, the classifier works in the same way as CLIP, i.e, the classifier weights are essentially composed of text embeddings.
When training, is the input on the text side the image's title, or is it just a template like "a photo of
@machuofan When training, is the input on the text side the image's title, or is it just a template like "a photo of
It's 'a xxx'.