show-attend-and-tell icon indicating copy to clipboard operation
show-attend-and-tell copied to clipboard

Beam search

Open wjb123 opened this issue 7 years ago • 7 comments

Hi, I am reading your excellent code, but find no beam search during caption generation as the source code in https://github.com/kelvinxu/arctic-captions, is there any reason ?

wjb123 avatar Jun 22 '17 06:06 wjb123

@wjb123 Yes, I also note that. But, is this the only difference between this code and the original? Have you find other differences? Because After I ran this code, I can't achieve the same experiment result as good as in the paper "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention".

MenSanYan avatar Jun 23 '17 00:06 MenSanYan

@MenSanYan A difference in score? Because the lack of a beam search would definitely result in worse results I believe.

Do either of you perhaps why there's the difference I noted in #40?

rubenvereecken avatar Jun 23 '17 12:06 rubenvereecken

@rubenvereecken @MenSanYan Do you have the beam search code about this code?? thanks!!

lvyongqiang4644 avatar Dec 04 '17 08:12 lvyongqiang4644

I'm afraid I never actually needed the beam search code as I was not working on ntm. I'm sure there is a Tensorflow implementation out there somewhere.

rubenvereecken avatar Dec 04 '17 10:12 rubenvereecken

I have read that beam search gives a boost in bleu-4 of around 10%. evaluate_model.ipynb shows a bleu of 21.1 whereas the paper reports 24.3 so that might be the reason for the difference.

  1. Were you able to train this model and get Bleu-4 score of 21.1 ? I am implementing the paper in Pytorch and was unable to reach a good Bleu score.

  2. I found this implementation and am mystified by the magical T/L in the loss ( As you also asked in https://github.com/yunjey/show-attend-and-tell/issues/40 ).

  3. The other difference I noticed was that this implementation uses conv5_3 layer of the vgg19. The paper says "In our experiments we use the 14×14×512 feature map of the fourth convolutional layer before max pooling. " which would correspond to some other layer

nishant-puri avatar Dec 04 '17 18:12 nishant-puri

b-1 b-2 b-3 b-4 METEOR 67.2 | 46.3 | 31.9 | 22.4 | 22.0 this is my best score

lvyongqiang4644 avatar Dec 06 '17 02:12 lvyongqiang4644

@nishant-puri I am also confused by it, the feature map in conv5_3 is actually 1414152, the original image size is 224, and after 4 max-pooling layers, the image size becomes 224/2/2/2/2 = 14, that should be correct. and the author used conv5_4 features. (see https://github.com/kelvinxu/arctic-captions/issues/1)

jamiechoi1995 avatar Aug 14 '18 13:08 jamiechoi1995