recognize-anything
recognize-anything copied to clipboard
How do you get thresholds for clip model in results of Table 3 ?
Hi, I want to know how do you get thresholds for clip model in results of Table 3 ? Is it the same way like you said in another issue?
Similar to the zero-shot inference of CLIP on ImageNet, we directly use "cross modal feature similarity + threshold" for image tagging testing. It is worth noting that this approach is very sensitive to the selection of threshold and difficult to apply in practice.
But I see the results of CLIP on Multi-label Classification datasets in Tabel 3 are competitive. Could you tell me how you determine the thresholds in detail?
Just manually adjusting the threshold to achieve the best performance for CLIP. For fair comparison, each model in Table 3 of RAM paper uses a unified threshold for each category.
Got it. Thank you very much!