open_clip icon indicating copy to clipboard operation
open_clip copied to clipboard

Get well adjusted confidence scores from similarity of CLIP encodings

Open justlike-prog opened this issue 1 year ago • 8 comments

I am using CLIP to check similarity between text and an image. Now for example I have list of words (objects) I want to check against. For example (“elephant”, “tiger”, “giraffe”).

By taking the dot product of the encodings I get the similarity value. To evaluate the “confidence” I take the softmax over the outputs and it works very well predicting which class is in the image. But it could happen that the classes are not mutually exclusive. In that case softmax doesn’t make sense. I tried to use sigmoid as it is used with multi-label classification, but it seems to give me values all around 0.55 (so classes that were correct around around 0.56 and classes that are wrong 0.54), so in the example (0.565, 0.55, 0.62) if elephant and giraffe are in the picture. Thus it is hard to set a threshold there.

I would like to have something like (0.95, 0.05, 0.98) if elefant and giraffe are in the picture, thus the similarity is high for both words.

Am I thinking too complicated and there is a standard way to do this? Is it even possible to get this well adjusted confidence score?

justlike-prog avatar Dec 23 '22 08:12 justlike-prog

What if you apply a threshold to the dot products directly (ie. the logits) ? Or use temperature into your softmax to smooth predictions ? Or add combinations such as "elephant and giraffe" in your prompt list ?

SimJeg avatar Dec 23 '22 09:12 SimJeg

I tried with logits, but it is hard to set a threshold there, since it is not normalised (or at least I think this is the issue), so something similar might be similarity score 22, but something not representing the class might as well be 21 or 5. Thus it is hard to set a threshold there.

Also checked combinations before, but the predictions seem to be less confident.

Softmax I can't use because of multilabel task. Or maybe you mean something else there.

justlike-prog avatar Dec 23 '22 09:12 justlike-prog

Did you apply L2 normalization before the dot products ? This should be done as cosine loss was used for training (so in fact it is normalized)

I proposed to add temperature in softmax so that if you have two high logits l1 > l2 you don't have p1 >> p2 after softmax.

Le ven. 23 déc. 2022, 15:23, justlike-prog @.***> a écrit :

I tried with logits, but it is hard to set a threshold there, since it is not normalised (or at least I think this is the issue), so something similar might be similarity score 22, but something not representing the class might as well be 21 or 5. Thus it is hard to set a threshold there.

Also checked combinations before, but the predictions seem to be less confident.

Softmax I can't use because of multilabel task. Or maybe you mean something else there.

— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/issues/318#issuecomment-1363799161, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE64VLWK5B26YGQOJB77HLWOVZBFANCNFSM6AAAAAATHQ7EFM . You are receiving this because you commented.Message ID: @.***>

SimJeg avatar Dec 23 '22 10:12 SimJeg

Yes, normalisation is applied. Again softmax doesn't work here well, with more non mutual exclusive classes this will be hard to adjust.

justlike-prog avatar Dec 23 '22 10:12 justlike-prog

Maybe if you have a dataset, calibrate logits (class wise) with a logistic regression ?

Le ven. 23 déc. 2022, 15:39, justlike-prog @.***> a écrit :

Yes, normalisation is applied. Again softmax doesn't work here well, with more non mutual exclusive classes this will be hard to adjust.

— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/issues/318#issuecomment-1363811814, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE64VMHMQZRUVDJLZZZCPTWOV27JANCNFSM6AAAAAATHQ7EFM . You are receiving this because you commented.Message ID: @.***>

SimJeg avatar Dec 23 '22 10:12 SimJeg

Yeah, was thinking about that thanks! I will give it a try. Although online I didn't see examples for calibration in multi label setting.

justlike-prog avatar Dec 23 '22 10:12 justlike-prog

I am not sure what you mean for multi-label calibration. If all your classes are well calibrated, a single threshold will "work" for all the classes by definition of calibration ?

Le ven. 23 déc. 2022, 16:16, justlike-prog @.***> a écrit :

Yeah, was thinking about that thanks! I will give it a try. Although online I didn't see examples for calibration in multi label setting.

— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/issues/318#issuecomment-1363840321, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE64VNFSBZDHINLGULSJBTWOV7HTANCNFSM6AAAAAATHQ7EFM . You are receiving this because you commented.Message ID: @.***>

SimJeg avatar Dec 23 '22 11:12 SimJeg

Also if you have a big enough dataset, I would recommend to train the logistic regression from the layer before the projection (i.e. replace proj @ text_encoder(tiger) by the weights of a new logistic regression)

Le ven. 23 déc. 2022, 16:34, Simon Jégou @.***> a écrit :

I am not sure what you mean for multi-label calibration. If all your classes Ci are well calibrated, a single threshold will "work" for all the classes by definition of calibration ?

Le ven. 23 déc. 2022, 16:16, justlike-prog @.***> a écrit :

Yeah, was thinking about that thanks! I will give it a try. Although online I didn't see examples for calibration in multi label setting.

— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/issues/318#issuecomment-1363840321, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE64VNFSBZDHINLGULSJBTWOV7HTANCNFSM6AAAAAATHQ7EFM . You are receiving this because you commented.Message ID: @.***>

SimJeg avatar Dec 23 '22 11:12 SimJeg

Ok I will check it out, thanks!

justlike-prog avatar Dec 28 '22 05:12 justlike-prog

@SimJeg Do you maybe know a guide somewhere about doing such a calibration? Never did anything like that. Thanks!

justlike-prog avatar Jan 03 '23 11:01 justlike-prog

@justlike-prog You can check sklearn's tutorial https://scikit-learn.org/stable/modules/calibration.html

mehdidc avatar Jan 03 '23 12:01 mehdidc

@mehdidc hmm, this one seems to work wit sklearn models rather than torch for example?

justlike-prog avatar Jan 03 '23 12:01 justlike-prog

@justlike-prog I would also advice sklearn : extract the scores with your torch model, then calibrate them (e.g. Platts scaling or isotonic regression). If you use logistic regression you can later append the weights as a linear layer to your torch model.

SimJeg avatar Jan 03 '23 13:01 SimJeg

Alright, will do that. Thanks a ton! Would I get a probability for each class independently? I would for example calibrate the scores for class elephant by using the logits as data and labels would be 0 or 1 dependent of if there is a elephant in the image or not? And do it it for each class independently?

justlike-prog avatar Jan 04 '23 07:01 justlike-prog

Yes exactly ! The quality of the calibration will depend on the number of samples as usual

Le mer. 4 janv. 2023, 08:26, justlike-prog @.***> a écrit :

Alright, will do that. Thanks a ton! Would I get a probability for each class independently? I would for example calibrate the scores for class elephant by using the logits as data and labels would be 0 or 1 dependent of if there is a elephant in the image or not? And do it it for each class independently?

— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/issues/318#issuecomment-1370567379, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE64VI6FK755O4V6GGKLFTWQUQZFANCNFSM6AAAAAATHQ7EFM . You are receiving this because you were mentioned.Message ID: @.***>

SimJeg avatar Jan 04 '23 09:01 SimJeg