keras-nlp
keras-nlp copied to clipboard
Make Changes to `RobertaCustom` Layer
@mattdangerw, @chenmoneygithub -
The original RoBERTa implementation has four different dropout variables: https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/roberta/model.py#L634-L637.
Our RobertaCustom layer, on the other hand, has only one: https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/roberta.py#L86.
In order to incorporate all four dropout arguments, we will have to modify the TransformerEncoder layer.
Secondly, there are some differences between XLM-R Base and XLM-XL: https://www.diffchecker.com/D0753p5i. For example, applying LayerNorm to inputs vs applying LayerNorm to outputs, whether to have a LayerNorm after embedding, etc. We'll have to take care of these vagaries.
Just because they are exposing these parameters does not mean we need to as well. We do need compatibility of our forward pass and their forward pass. But we don't need a one to one API.
In the config file you linked they set activation_dropout = pooler_dropout = 0 and dropout = attention_dropout = 0.1. That actually seems fully compatible with the offering we have in TransformerEncoder. Do they actually set activation dropout to a non-zero value for some Roberta variant?
Agreed we need to handle the differences between XMLR base and XL (it's annoying that their architecture changes between what is supposed to be difference sizes).
But maybe let's start with the easy version of the problem. If XMLR extra large is the outlier, let's just work on all the smaller size first :)