ImageCaptioning.pytorch Benchmarks

Cross entropy loss (Cider score on validation set without beam search; 25epochs): fc 0.92 att2in 0.95 att2in2 0.99 topdown 1.01

(self critical training is in https://github.com/ruotianluo/self-critical.pytorch) Self-critical training. (Self critical after 25epochs; Suggestion: don't start self critical too late): att2in 1.12 topdown 1.12

Test split (beam size 5): cross entropy: topdown: 1.07

self-critical: topdown: Bleu_1: 0.779 Bleu_2: 0.615 Bleu_3: 0.467 Bleu_4: 0.347 METEOR: 0.269 ROUGE_L: 0.561 CIDEr: 1.143 att2in2: Bleu_1: 0.777 Bleu_2: 0.613 Bleu_3: 0.465 Bleu_4: 0.347 METEOR: 0.267 ROUGE_L: 0.560 CIDEr: 1.156

Aug 04 '17 21:08 ruotianluo

is there any code or options, to show how to train any of these models (topdown, etc) with self-critical algorithm? @ruotianluo

Sep 27 '17 06:09 SJTUzhanglj

It's in my another repository

Sep 27 '17 06:09 ruotianluo

Did you fine-tune the CNN when trained the model with cross entropy loss?

Sep 27 '17 11:09 miracle24

No.

Sep 27 '17 11:09 ruotianluo

Wow. It's unbelievable. I can not achieve that high score without fine-tune when train my own captioning model under cross entropy loss. Most papers I have read will fine-tune the CNN when train the model with cross entropy loss. Is there any tips when train the model with cross entropy?

Sep 27 '17 12:09 miracle24

Finetuning is actually worse. It's about how to extract the features, check the self critical sequence training paper.

Sep 27 '17 12:09 ruotianluo

I think they means they did not do finetuning when trained the model under RL loss, while they did not mention whether they finetune the CNN when train the model under cross entropy loss.

Sep 27 '17 12:09 miracle24

I finetnue the CNN under cross entropy loss as neuraltalk2 (Lua version) and I got cider 0.91 on validation set without beamsearch. Then I train the self-critical model without finetuning based on the best pretrained model and I finally got cider almost close result compared with self-critical paper.

Sep 27 '17 12:09 miracle24

They didn't fine-tune in both phase. And finetuning may not work as well under attention based model.

Sep 27 '17 12:09 ruotianluo

I did not train the attention based model. But I will try. Thank you and your codes. I will start learning pytorch with you code.

Sep 27 '17 12:09 miracle24

Dear @ruotianluo, Thank you for your fantastic code. Would you please tell me all of your used parameters for run the train.py code? (In fact, I used your code, as the guidance in the ReadMe file, but when I have used and tested the trained model, I got same result (i.e., same captions) for all of my different test images). It is worth noting that, I have used --language_eval 0, and maybe this wrong parameter caused these obtained results, am I correct?

Oct 07 '17 17:10 ahkarami

Can you try downloading the pertrained model and evaluate on your test images. It helps me to narrow down the problem.

Oct 07 '17 17:10 ruotianluo

Yes, I can download the pre-trained models and use them. The results from pre-Trained models were appropriate and nice; However, the obtained results from my Trained models were same for all of the images. It seems something wrong with my used parameters for training and the trained model produced same caption for all of given images.

Oct 07 '17 17:10 ahkarami

You should be able to reproduce my result following my instructions, it's really weird. Anyway which options are not clear to me (most of the options are explained in the opts.py)?

Oct 07 '17 17:10 ruotianluo

Thank you very much for your help. The problem has been solved. In fact, I have trained your code on another Synthetic data set, and as a result the error has been occurred. However, when I used your code on MS-COCO data set, the training process hasn't any problem. Just as another question, would you please kindly tell me the appropriate value of parameters for training? I mean the appropriate values for parameters such as beam_size, rnn_size, num_layers, rnn_type, learning_rate, learning_rate_decay_every, and scheduled_sampling_start.

Oct 08 '17 10:10 ahkarami

@ahkarami is the previous problem related to my code? I think it varies from dataset to dataset. Beam size could be 5. The numbers I set are the same as in the readme.

Oct 08 '17 16:10 ruotianluo

Dear @ruotianluo, No, the previous problem related to my data set, and your code is correct. In fact, in my data set the repetitious words are many. Moreover, the length of sentences vary from ~15 up to 90 words. I have changed the parameters of the prepro_labels.py by --max_length = 50 & --word_count_threshold = 2 then after about 40 epochs, the produced results are not same for any given image; However the results were bad and not appropriate. I think still my parameters for training & pre-processing the labels are not appropriate.

Oct 08 '17 17:10 ahkarami

Hi @ruotianluo , Thank you for your code and benchmark, did you test the adaptive attention on your code?? Could you output the adaptive attention's result?? Thank you again.

Nov 13 '17 11:11 xyy19920105

Actually no. I didn't spend much time on that model.

Nov 13 '17 14:11 ruotianluo

Thanks for your reply. Do you think that the adaptive attention model is not good enough as a baseline??

Nov 14 '17 09:11 xyy19920105

It's good, just I couldn't get it work well.

Nov 14 '17 13:11 ruotianluo

Could you clarify, which features are used for the results above? resnet152? And does fc stand for ShowTell?

Jan 05 '18 00:01 dmitriy-serdyuk

@dmitriy-serdyuk it's using res101. and FC stands for the FC model in self critical sequence training paper which can be regarded as a variant of showtell.

Jan 05 '18 00:01 ruotianluo

Thank you for your fantastic code. I am a beginner, and it helped me a lot. I have a question about the 'LSTMCore' class in the FCModel.py. Why don't you use the official LSTM model and train it by step, or the LSTMCell model and add a dropout layer on it? Is there any difference between your code and them?

Mar 01 '18 05:03 chynphh

The in gate is different. https://github.com/ruotianluo/ImageCaptioning.pytorch/blob/master/models/FCModel.py#L34

Mar 01 '18 05:03 ruotianluo

OK, I got it. But why do you make this change? Is there any paper or any research about this?

Mar 01 '18 06:03 chynphh

Self-critical Sequence Training for Image Captioning https://arxiv.org/abs/1612.00563

Mar 01 '18 06:03 ruotianluo

Thank you very much!

Mar 01 '18 06:03 chynphh

i am wondering if you only use the 80K dataset to get such a high performance on validation set, or use 110K dataset? I am doing experiment on karpathy split and use 80K dataset, but i get only 0.72 in terms of cider when using only train set. If so, can you give me some tips on training the net.

Mar 08 '18 04:03 eriche2016

BTW, i am using show attend model for my experiment.

Mar 08 '18 04:03 eriche2016

ImageCaptioning.pytorch ImageCaptioning.pytorch copied to clipboard

Benchmarks

ImageCaptioning.pytorch
ImageCaptioning.pytorch copied to clipboard