keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

Make Changes to `RobertaCustom` Layer

Open abheesht17 opened this issue 3 years ago • 2 comments

@mattdangerw, @chenmoneygithub -

The original RoBERTa implementation has four different dropout variables: https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/roberta/model.py#L634-L637.

Our RobertaCustom layer, on the other hand, has only one: https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/roberta.py#L86.

In order to incorporate all four dropout arguments, we will have to modify the TransformerEncoder layer.

Secondly, there are some differences between XLM-R Base and XLM-XL: https://www.diffchecker.com/D0753p5i. For example, applying LayerNorm to inputs vs applying LayerNorm to outputs, whether to have a LayerNorm after embedding, etc. We'll have to take care of these vagaries.

abheesht17 avatar Sep 18 '22 06:09 abheesht17

Just because they are exposing these parameters does not mean we need to as well. We do need compatibility of our forward pass and their forward pass. But we don't need a one to one API.

In the config file you linked they set activation_dropout = pooler_dropout = 0 and dropout = attention_dropout = 0.1. That actually seems fully compatible with the offering we have in TransformerEncoder. Do they actually set activation dropout to a non-zero value for some Roberta variant?

mattdangerw avatar Sep 20 '22 00:09 mattdangerw

Agreed we need to handle the differences between XMLR base and XL (it's annoying that their architecture changes between what is supposed to be difference sizes).

But maybe let's start with the easy version of the problem. If XMLR extra large is the outlier, let's just work on all the smaller size first :)

mattdangerw avatar Sep 20 '22 00:09 mattdangerw