CLIP_prefix_caption Conceptual Captions Training

I have trained the model (both MLP and GPT-2) using the CC3M dataset but the loss doesn't seem to decrease very much (stays around 3.0). What loss can I expect for a good model? How many epochs should I run it for? Also, is any specific hyperparameter tuning required for CC? I have a model trained for 5 epochs but it generates a similar caption for every image. I tried fitting on a batch of 512 image-caption pairs and everything works out so I don't think there is any logical issue with the pipeline. Please let me know.

Feb 01 '22 02:02 goel-shashank

Hi @goel-shashank, Are you using our default parameters?

Did you try both the GPT-2 fine-tuning and the frozen GPT-2?

Feb 01 '22 20:02 rmokady

Hi @rmokady, I tried the default parameters. Do you have the training logs for your run? One thing I'm certainly doing differently is that I have trained a separate CLIP model (RN50 with 20% imagenet zero-shot accuracy) which is trained on CC3M (not OpenAI's pretrained). The prefixes are generated from this model. I don't think this should be causing these issues.

Feb 01 '22 21:02 goel-shashank

For COCO where we train both prefix and GPT-2 the loss got to 1.47 Unfortunately, the logs for the Conceptual got left on an old server and cannot access these anymore 5 epochs for 3M images is a lot using the standard clip

Anyway, outputting the same sentence for any prefix usually means there is a bug somewhere

Feb 03 '22 21:02 rmokady

As I mentioned, I was able to fit on a batch of 512 image-caption pairs and everything works out so I don't think there is any logical issue with the pipeline. But still, I will check everything for once. Closing this issue! Please let me know if you find something useful!

Feb 05 '22 01:02 goel-shashank

Hi @goel-shashank, I find some logs for conceptual captions. This is with the resnet CLIP: Screen Shot 2022-02-19 at 16 33 41

Feb 19 '22 14:02 rmokady

This is with the Vit CLIP: Screen Shot 2022-02-19 at 16 35 36

Feb 19 '22 14:02 rmokady

I have the same problem with my own dataset. It keeps generating similar captions...

Mar 26 '22 18:03 ycchanau

Hi, I have the same problem for Conceptual Captions + frozen model. Do you have loss values for that scenario? All the inputs end up converging to the same prefix. Thanks!

I followed the README and ran:

python parse_conceptual.py --clip_model_type ViT-B/32 --data_root /path/to/conceptual_captions --num_threads 100

and then

python train.py --only_prefix --data /path/to/conceptual_captions/conceptual_clip_ViT-B_32_train.pkl --out_dir /path/to/output_dir --mapping_type transformer --num_layers 8 --prefix_length 40 --prefix_length_clip 40

Jul 12 '22 15:07 surisdi

@surisdi Did you manage to reproduce the results?

Dec 04 '22 17:12 mmderakhshani

CLIP_prefix_caption CLIP_prefix_caption copied to clipboard

Conceptual Captions Training

CLIP_prefix_caption
CLIP_prefix_caption copied to clipboard