CLIP
CLIP copied to clipboard
Fine-tuning CLIP for image object detection.
Thank you for publishing such great research and the implementation! This is more of a question rather than an issue per se.
I have used the image encoder of CLIP in the to encode video frames for video action classification. I used it as a frozen backbone and it outperformed all other backbones I was using (ViT being the second best). Now, I would like to use it again as a backbone for image object detection. This time however, I would like to fine-tune it while training the rest of the model. The dataset I have contains around 200K images. I would like to ask if you think fine-tuning would be feasible in this case since the task I will be fine-tuning on will be quite different than the original task it was trained on. I am supposing that a lower learning rate (than the one used for the rest of the model), a lower weight decay value and a late start of fine-tuning would be necessary in this case. Any other advice for fine-tuning in this case would be more than welcome :)
could you share the way you implement the clip image_encoder as backbone?