TransformerEncoderLayer change lead to different behavior on legacy code
A heads up for whom might be using Transformer encoder. The transformer Encoder Layer forward loop use to be:
tensor = tensor + self.dropout(self.attention(tensor, mask=mask))
tensor = _normalize(tensor, self.norm1)
tensor = tensor + self.dropout(self.ffn(tensor))
tensor = _normalize(tensor, self.norm2)
tensor *= mask.unsqueeze(-1).type_as(tensor)
return tensor
While the current in master: https://github.com/facebookresearch/ParlAI/blob/a3c4c3ef68c616e1bb6cc0796313356117a6bcd4/parlai/agents/transformer/modules.py#L588 As the default variant is None, the default forward behavior is changed from the past when no variant is given https://github.com/facebookresearch/ParlAI/blob/a3c4c3ef68c616e1bb6cc0796313356117a6bcd4/parlai/agents/transformer/modules.py#L602
if self.variant == 'prelayernorm':
tensor = _normalize(tensor, self.norm1)
attended_tensor, _ = self.attention(tensor, mask=mask)
tensor = residual + self.dropout(attended_tensor)
if self.variant == 'aiayn' or self.variant == 'xlm':
tensor = _normalize(tensor, self.norm1)
residual = tensor
if self.variant == 'prelayernorm':
tensor = _normalize(tensor, self.norm2)
tensor = residual + self.dropout(self.ffn(tensor))
if self.variant == 'aiayn' or self.variant == 'xlm':
tensor = _normalize(tensor, self.norm2)
tensor *= mask.unsqueeze(-1).type_as(tensor)
return tensor
It requires extra investigation when we load old model back. Especially for those directly using TransformerEncoderLayers.
Suggestion: Change that default to 'xlm'.
Dexter, please embed multilines when you quote code, the issues are difficult to read without this.
I don't think a default variant should be xlm, it should be aiayn if anything, but really we should just make None behave the same as aiayn.
This issue has not had activity in 30 days. Marking as stale.
closing as i'm not entirely sure what the issue is here; please reopen if this is breaking