CLIP
CLIP copied to clipboard
Clip's capablity of detecting scene or background information
how is clip performing on global information detection? For example, finding whether an image is noisy-corrupted, downsample-d or hazy, and furthermore, choosing the right corruption parameters like noise std? I tried images with different types of noises like gaussian poisson or gamma, and other corruptions like downsampling or hazy, and tokens like [gaussian noise with std=25, gaussian noise with std=50], [noisy, hazy], but the inference result is not well. Am i missing any key parts on my way of testing?
CLIP is not for text generation. CLIP needs text from the user as input to create its embeddings, which matches with the image and you can get a cosine similarity score. CLIP is not good for fine-grained classification (as mentioned in the paper) CLIP is trained on internet data, which may focus more on everyday objects. For example, if an image of a car on the internet is noisy/hazy, a high chance that the text description over the internet still mentions 'car' and not about 'noise'.