ImageCaptioning.pytorch
ImageCaptioning.pytorch copied to clipboard
Benchmarks
Cross entropy loss (Cider score on validation set without beam search; 25epochs): fc 0.92 att2in 0.95 att2in2 0.99 topdown 1.01
(self critical training is in https://github.com/ruotianluo/self-critical.pytorch) Self-critical training. (Self critical after 25epochs; Suggestion: don't start self critical too late): att2in 1.12 topdown 1.12
Test split (beam size 5): cross entropy: topdown: 1.07
self-critical: topdown: Bleu_1: 0.779 Bleu_2: 0.615 Bleu_3: 0.467 Bleu_4: 0.347 METEOR: 0.269 ROUGE_L: 0.561 CIDEr: 1.143 att2in2: Bleu_1: 0.777 Bleu_2: 0.613 Bleu_3: 0.465 Bleu_4: 0.347 METEOR: 0.267 ROUGE_L: 0.560 CIDEr: 1.156
is there any code or options, to show how to train any of these models (topdown, etc) with self-critical algorithm? @ruotianluo
It's in my another repository
Did you fine-tune the CNN when trained the model with cross entropy loss?
No.
Wow. It's unbelievable. I can not achieve that high score without fine-tune when train my own captioning model under cross entropy loss. Most papers I have read will fine-tune the CNN when train the model with cross entropy loss. Is there any tips when train the model with cross entropy?
Finetuning is actually worse. It's about how to extract the features, check the self critical sequence training paper.
I think they means they did not do finetuning when trained the model under RL loss, while they did not mention whether they finetune the CNN when train the model under cross entropy loss.
I finetnue the CNN under cross entropy loss as neuraltalk2 (Lua version) and I got cider 0.91 on validation set without beamsearch. Then I train the self-critical model without finetuning based on the best pretrained model and I finally got cider almost close result compared with self-critical paper.
They didn't fine-tune in both phase. And finetuning may not work as well under attention based model.
I did not train the attention based model. But I will try. Thank you and your codes. I will start learning pytorch with you code.
Dear @ruotianluo,
Thank you for your fantastic code. Would you please tell me all of your used parameters for run the train.py code? (In fact, I used your code, as the guidance in the ReadMe file, but when I have used and tested the trained model, I got same result (i.e., same captions) for all of my different test images). It is worth noting that, I have used --language_eval 0
, and maybe this wrong parameter caused these obtained results, am I correct?
Can you try downloading the pertrained model and evaluate on your test images. It helps me to narrow down the problem.
Yes, I can download the pre-trained models and use them. The results from pre-Trained models were appropriate and nice; However, the obtained results from my Trained models were same for all of the images. It seems something wrong with my used parameters for training and the trained model produced same caption for all of given images.
You should be able to reproduce my result following my instructions, it's really weird. Anyway which options are not clear to me (most of the options are explained in the opts.py)?
Thank you very much for your help. The problem has been solved. In fact, I have trained your code on another Synthetic data set, and as a result the error has been occurred. However, when I used your code on MS-COCO data set, the training process hasn't any problem. Just as another question, would you please kindly tell me the appropriate value of parameters for training? I mean the appropriate values for parameters such as beam_size, rnn_size, num_layers, rnn_type, learning_rate, learning_rate_decay_every, and scheduled_sampling_start.
@ahkarami is the previous problem related to my code? I think it varies from dataset to dataset. Beam size could be 5. The numbers I set are the same as in the readme.
Dear @ruotianluo,
No, the previous problem related to my data set, and your code is correct. In fact, in my data set the repetitious words are many. Moreover, the length of sentences vary from ~15 up to 90 words. I have changed the parameters of the prepro_labels.py
by --max_length = 50
& --word_count_threshold = 2
then after about 40 epochs, the produced results are not same for any given image; However the results were bad and not appropriate. I think still my parameters for training & pre-processing the labels are not appropriate.
Hi @ruotianluo , Thank you for your code and benchmark, did you test the adaptive attention on your code?? Could you output the adaptive attention's result?? Thank you again.
Actually no. I didn't spend much time on that model.
Thanks for your reply. Do you think that the adaptive attention model is not good enough as a baseline??
It's good, just I couldn't get it work well.
Could you clarify, which features are used for the results above? resnet152? And does fc
stand for ShowTell?
@dmitriy-serdyuk it's using res101. and FC stands for the FC model in self critical sequence training paper which can be regarded as a variant of showtell.
Thank you for your fantastic code. I am a beginner, and it helped me a lot. I have a question about the 'LSTMCore' class in the FCModel.py. Why don't you use the official LSTM model and train it by step, or the LSTMCell model and add a dropout layer on it? Is there any difference between your code and them?
The in gate is different. https://github.com/ruotianluo/ImageCaptioning.pytorch/blob/master/models/FCModel.py#L34
OK, I got it. But why do you make this change? Is there any paper or any research about this?
Self-critical Sequence Training for Image Captioning https://arxiv.org/abs/1612.00563
Thank you very much!
i am wondering if you only use the 80K dataset to get such a high performance on validation set, or use 110K dataset? I am doing experiment on karpathy split and use 80K dataset, but i get only 0.72 in terms of cider when using only train set. If so, can you give me some tips on training the net.
BTW, i am using show attend model for my experiment.