CLIP Fine-tuning CLIP for image object detection.

Fine-tuning CLIP for image object detection.

Open dukaenea opened this issue 2 years ago • 1 comments

Thank you for publishing such great research and the implementation! This is more of a question rather than an issue per se.

I have used the image encoder of CLIP in the to encode video frames for video action classification. I used it as a frozen backbone and it outperformed all other backbones I was using (ViT being the second best). Now, I would like to use it again as a backbone for image object detection. This time however, I would like to fine-tune it while training the rest of the model. The dataset I have contains around 200K images. I would like to ask if you think fine-tuning would be feasible in this case since the task I will be fine-tuning on will be quite different than the original task it was trained on. I am supposing that a lower learning rate (than the one used for the rest of the model), a lower weight decay value and a late start of fine-tuning would be necessary in this case. Any other advice for fine-tuning in this case would be more than welcome :)

Jun 08 '22 15:06 dukaenea

could you share the way you implement the clip image_encoder as backbone?

Jun 14 '22 08:06 iasonasxrist

CLIP CLIP copied to clipboard

Fine-tuning CLIP for image object detection.

CLIP
CLIP copied to clipboard