VG-GPLMs
VG-GPLMs copied to clipboard
`image_len` is not uesd?
https://github.com/HLTCHKUST/VG-GPLMs/blob/ecd40e8d884123666a339ef9d2968b178610b898/src/models/modeling_bart.py#L749
image_len
is not uesd in calculate attn?
image_len=None, means the default value is None, you can pass a int list wiht batch size to this function
I mean is "do not use image_len
in calculate attn(as mask
)"
And is some error in attn softmax dim? https://github.com/HLTCHKUST/VG-GPLMs/blob/ecd40e8d884123666a339ef9d2968b178610b898/src/models/modeling_bart.py#L882
attn shape is [batch_size(0), text_len(1), image_len(2)], should "softmax" in "image_len dim (2)" So I think "the softmax dim should 2 not 1"?
is there something wrong with my thinking?
I see. The image_len is not used in the multimodal fusion function. You can put this as a mask in the cross-attention. Probably it can improve the performance slightly.
https://github.com/HLTCHKUST/VG-GPLMs/blob/ecd40e8d884123666a339ef9d2968b178610b898/src/models/modeling_bart.py#L882
I think L882
should be 【reason see up】
attn = F.softmax(attn, dim=2)
am I wrong?