Audio-Visual-Video-Caption
Audio-Visual-Video-Caption copied to clipboard
Pytorch implementation of audio-visual fusion video captioning model
hi, could you tell the way to split the msr-vtt dataset? Many thanks!!
Hi, Why are the multilevel attentions being used during encoding? They are used only during decoding according to the paper about Multimodal attention..
Hi, the dataset isn't available in the links you mentioned before in a different issue. Kindly guide..
Hi, I would like to try your video captioning model on my own videos, could you please provide the pre-trained model?
n_layers=opt['num_layers'], rnn_cell=opt['rnn_type'], rnn_dropout_p=opt['rnn_dropout_p']).cuda() KeyError: 'rnn_type'
Thanks for your work, could you provide the S2VT model ?