ImageCaptioning.pytorch Training problems about BertCapModel

The BertCapModel uses two BertModel as encoder and decoder. However, in the config about BertModel, it sets max_position_embeddings = 17 in decoder, which leads to tensor size not matching, like this:

 File "captioning/captioning/models/BertCapModel.py", line 61, in decode
    encoder_attention_mask=src_mask)[0]
  File "./lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "./lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 862, in forward
    input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
  File "./lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "./lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 204, in forward
    embeddings += position_embeddings
RuntimeError: The size of tensor a (200) must match the size of tensor b (17) at non-singleton dimension

In this problem, it means that seq_length must be equal to max_position_embeddings. However, when changing the max_position_embeddings to seq_length, it also has some problems:

AssertionError: If `encoder_hidden_states` are passed, BertLayer(
  (attention): BertAttention(
    (self): BertSelfAttention(
      (query): Linear(in_features=512, out_features=512, bias=True)
      (key): Linear(in_features=512, out_features=512, bias=True)
      (value): Linear(in_features=512, out_features=512, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (output): BertSelfOutput(
      (dense): Linear(in_features=512, out_features=512, bias=True)
      (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (intermediate): BertIntermediate(
    (dense): Linear(in_features=512, out_features=512, bias=True)
  )
  (output): BertOutput(
    (dense): Linear(in_features=512, out_features=512, bias=True)
    (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
) has to be instantiated with cross-attention layers by setting `config.add_cross_attention=True`

At last, in order to check the correctness about my test, I also want to know the input shape about BertCapModel.

Jan 25 '21 05:01 bugczw

Try setting config.add_cross_attention=True, I think it is a new thing, thus I didn't have it.

Jan 25 '21 05:01 ruotianluo

Also, I want to know how many is the input seq_length. What's more, how should I set the config of BertCapModel?

Jan 25 '21 05:01 bugczw

Try setting config.add_cross_attention=True, I think it is a new thing, thus I didn't have it.

Should I change BertConfig.max_position_embeddings?

Jan 25 '21 05:01 bugczw

You can try remove the max positional embedding and see if it works.

Jan 25 '21 05:01 ruotianluo

ImageCaptioning.pytorch ImageCaptioning.pytorch copied to clipboard

Training problems about BertCapModel

ImageCaptioning.pytorch
ImageCaptioning.pytorch copied to clipboard