vilbert_beta icon indicating copy to clipboard operation
vilbert_beta copied to clipboard

image captioning with ViLBERT

Open szalata opened this issue 5 years ago • 0 comments

Figure 5 in the paper shows samples of generated image descriptions, but I couldn't reproduce similar results using the pretrained ViLBERT. I have used the BertForMultiModalPreTraining and supplied as features the features of the image which seem to be OK, given that the prediction_scores_v (that is the hv vector in the paper) seeems to reflect what is in the picture. As the "question", I have supplied a tensor with 30 [MASK] tokens. Then I have been, following the paper, passing that through the model 30 times and at each iteration setting ith token of the "question" (text stream) to the text token with the highest score at the ith position. I have also tried repeating the procedure multiple times, but it didn't change much. This results in very poor captions, such as "the a man is a man who is a man who is a man ...".

Could you please elaborate on the captioning method you've presented in the publication?

szalata avatar Oct 03 '19 13:10 szalata