OFA HF transformers caching causing errors

I'm trying the latest update code on AWS Sagemaker 1.10 GPU for the HF transformers. The following change to config was made a few days ago - https://huggingface.co/OFA-Sys/OFA-large/commit/d41e09bed9fcec4fd3a1f7d7a7bd5839043c87c3#d2h-025836 - for the respective models. When caching is enabled then the following error occurs which is to do the tensor being the wrong shape, see the below error. When manually changed so only the cacahing value in config is disabled then the generation still works.

Error:

/opt/conda/lib/python3.8/site-packages/transformers/models/ofa/modeling_ofa.py in forward(self, hidden_states, attention_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions, use_cache, self_attn_bias, cross_attn_bias) 604 self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None 605 # add present self-attn cache to position 1,2 of present_key_value tuple --> 606 hidden_states, self_attn_weights, present_key_value = self.self_attn( 607 hidden_states=hidden_states, 608 past_key_value=self_attn_past_key_value,

/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1129 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1130 or _global_forward_hooks or _global_forward_pre_hooks): -> 1131 return forward_call(*input, **kwargs) 1132 # Do not call functions when jit is used 1133 full_backward_hooks, non_full_backward_hooks = [], []

/opt/conda/lib/python3.8/site-packages/transformers/models/ofa/modeling_ofa.py in forward(self, hidden_states, key_value_states, past_key_value, attention_mask, output_attentions, attn_bias) 386 if attention_mask is not None: 387 if attention_mask.size() != (bsz, 1, tgt_len, src_len): --> 388 raise ValueError( 389 f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}" 390 )

ValueError: Attention mask should be of size (25, 1, 1, 12), but is torch.Size([25, 1, 1, 1])

Jul 12 '22 17:07 dwlmt

Would you mind checking the update on this readme https://github.com/OFA-Sys/OFA/blob/feature/add_transformers/transformers.md ? Sorry about that, I guess it might be because you are using the old generator provided by HF. We find that its generation still slightly differs from the Fairseq generator, which might cause a little bit performance degrade. Thus in the showcase, we change the sequence generator to the original Fairseq one, and find almost no difference ultimately.

Anyway, many thx for reporting the problem to us. I'll change use_cache to False by default to get rid of the problem temporarily.

Jul 13 '22 16:07 JustinLin610

Yes, was using the old version. Will look at the Fairseq one but want to use typical_p and constrained beam sampling from HF and so will need to see if they are supported on Fairseq.

Jul 14 '22 08:07 dwlmt