ImageCaptioning.pytorch
ImageCaptioning.pytorch copied to clipboard
Training problems about BertCapModel
The BertCapModel uses two BertModel as encoder and decoder. However, in the config about BertModel, it sets max_position_embeddings = 17 in decoder, which leads to tensor size not matching, like this:
File "captioning/captioning/models/BertCapModel.py", line 61, in decode
encoder_attention_mask=src_mask)[0]
File "./lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "./lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 862, in forward
input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
File "./lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "./lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 204, in forward
embeddings += position_embeddings
RuntimeError: The size of tensor a (200) must match the size of tensor b (17) at non-singleton dimension
In this problem, it means that seq_length must be equal to max_position_embeddings. However, when changing the max_position_embeddings to seq_length, it also has some problems:
AssertionError: If `encoder_hidden_states` are passed, BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=512, out_features=512, bias=True)
(key): Linear(in_features=512, out_features=512, bias=True)
(value): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=512, out_features=512, bias=True)
(LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=512, out_features=512, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=512, out_features=512, bias=True)
(LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
) has to be instantiated with cross-attention layers by setting `config.add_cross_attention=True`
At last, in order to check the correctness about my test, I also want to know the input shape about BertCapModel.
Try setting config.add_cross_attention=True, I think it is a new thing, thus I didn't have it.
Also, I want to know how many is the input seq_length. What's more, how should I set the config of BertCapModel?
Try setting config.add_cross_attention=True, I think it is a new thing, thus I didn't have it.
Should I change BertConfig.max_position_embeddings?
You can try remove the max positional embedding and see if it works.