open_clip Is there a way to do multi-label classification with CLIP?

The concrete use case is a as following. I have the classes baby, child, teen, adult. My idea was to use similarity between text and image features (for text features I used the prompt 'there is at least one (c) in the photo', c being one of the 4 classes).

I went through quite a lot of examples, but I am running into the issue that the similarity scores are often very different for a fixed class or/and classes that appear might have a very similar threshold (like baby and child). For similarity scores I use the cosine similarity multiplied by 2.5 to stretch the score into the interval [0, 1] as is done in the CLIP Score paper.

Setting a threshold in that sense doesn't seem possible.

Does anyone have an idea for that? I feel quite stuck here, how I should proceed.

Jan 02 '23 10:01 justlike-prog

not sure if it would work but have you by any chance looked at using captions like "this is a photo of a ','.join(subset)" where subset iterates over all subsets of your current classes? so then you'd have 2^4 classes instead of 4

Jan 02 '23 17:01 mitchellnw

I am attempting this now training on captions with multiple labels and then querying with single labels, and it works pretty badly compared to any normal multi-label classifier.

{'f1': 0.08291136675917679, 'precision': 0.07481833065257353, 'recall': 0.10588978264912757}

If I figure this out I will let you know.

Mar 22 '23 13:03 AmericanPresidentJimmyCarter

Take a look at this paper: "DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations"

I struggled with this problem for a while and this approach is working for me.

Jun 23 '23 05:06 Msalehi237

@AmericanPresidentJimmyCarter did find a way to improve the multi-labelling performance?

Mar 28 '24 19:03 travellingsasa

No, I just trained multilabel classifiers instead and those worked.

Mar 31 '24 18:03 AmericanPresidentJimmyCarter

@travellingsasa

You can do some sort of anti-text or placeholder text to do multi-label classification, ex:

your objective is checking in there is the presence of "red" in an image of a dress, then use:

["a red dress", "a dress"]

that will give you a probability distribution and you take the zero index

Apr 19 '24 12:04 miguelalba96

@travellingsasa

You can do some sort of anti-text or placeholder text to do multi-label classification, ex:

your objective is checking in there is the presence of "red" in an image of a dress, then use:
["a red dress", "a dress"]
that will give you a probability distribution and you take the zero index

How does that work? If the image contains neither your result will be essentially random. I think it only works if you have a multi-label classifier to identify a dress in the first place.

Apr 20 '24 20:04 AmericanPresidentJimmyCarter

open_clip open_clip copied to clipboard

Is there a way to do multi-label classification with CLIP?

open_clip
open_clip copied to clipboard