CLIP icon indicating copy to clipboard operation
CLIP copied to clipboard

Clip's capablity of detecting scene or background information

Open Seeker98 opened this issue 1 year ago • 1 comments

how is clip performing on global information detection? For example, finding whether an image is noisy-corrupted, downsample-d or hazy, and furthermore, choosing the right corruption parameters like noise std? I tried images with different types of noises like gaussian poisson or gamma, and other corruptions like downsampling or hazy, and tokens like [gaussian noise with std=25, gaussian noise with std=50], [noisy, hazy], but the inference result is not well. Am i missing any key parts on my way of testing?

Seeker98 avatar Sep 21 '23 14:09 Seeker98

CLIP is not for text generation. CLIP needs text from the user as input to create its embeddings, which matches with the image and you can get a cosine similarity score. CLIP is not good for fine-grained classification (as mentioned in the paper) CLIP is trained on internet data, which may focus more on everyday objects. For example, if an image of a car on the internet is noisy/hazy, a high chance that the text description over the internet still mentions 'car' and not about 'noise'.

mgupta70 avatar Jan 31 '24 08:01 mgupta70