vilbert_beta
vilbert_beta copied to clipboard
image captioning with ViLBERT
Figure 5 in the paper shows samples of generated image descriptions, but I couldn't reproduce similar results using the pretrained ViLBERT. I have used the BertForMultiModalPreTraining and supplied as features the features of the image which seem to be OK, given that the prediction_scores_v (that is the hv vector in the paper) seeems to reflect what is in the picture. As the "question", I have supplied a tensor with 30 [MASK] tokens. Then I have been, following the paper, passing that through the model 30 times and at each iteration setting ith token of the "question" (text stream) to the text token with the highest score at the ith position. I have also tried repeating the procedure multiple times, but it didn't change much. This results in very poor captions, such as "the a man is a man who is a man who is a man ...".
Could you please elaborate on the captioning method you've presented in the publication?