keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

Parameter Tying

Open arivero opened this issue 2 years ago • 4 comments

Is your feature request related to a problem? Please describe.

A lot of models, including GPT, use the same weights matrix for the embedding of the input and, transposing it, for the "unembedding" towards logits. This means the same tensor must be in two different layers, or a layer must be able to work with two different kind of inputs.

Describe the solution you'd like

I think the ideal situation is to be able to signal, when creating a layer, that some of the weights must be taken from another layer. This also allows for the slightly more general case when you just create a tensor and provide it as weight for both layers.

Describe alternatives you've considered

In principle it could be possible to do a "soft tying" of weights in two layers by using a loss measuring the difference between both layers and adding it to the regularisation losses of one or both layers, or to the model. It is unclear if it works.

Also, it could be possible to play with the initialization function of the layers to make sure it provides the same tensor to the two layers; this seems feasible but hacky, if it is not given as an official option.

arivero avatar Mar 20 '23 03:03 arivero

In backbone models, the token embedding is exposed as a property: https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/gpt2/gpt2_backbone.py#L193.

In GPT2CausalLM, we take this layer, and transpose it to get the "unembedding" layer: https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/gpt2/gpt2_causal_lm.py#L249. Is this what you are suggesting?

abheesht17 avatar Mar 20 '23 04:03 abheesht17

Indeed GPT is the main case of use, and I wonder if it could be possible to "upgrade" the "unembedding operation" to be really a layer and avoid a diversification of hacks. Perhaps for LM it is all that we need, even if parameter tying is a more general concept.

A main issue is about experimenting with fine tuning, I guess. While most of the time there is no frozen layers, sometimes it could be interesting to freeze only the attention layers, or only the embedding. Also during training, it could be interesting to experiment with different values coming from the embedding or from the unembedding. Finally, it is valuable to have a fixed standard or recommended approach, if one is going to experiment with further modifications of the embedding (as for example soft prompts, see #889)

arivero avatar Mar 20 '23 09:03 arivero

Gotcha! Good point

abheesht17 avatar Mar 21 '23 19:03 abheesht17

Support for this at a low-level is now done with https://keras.io/api/keras_nlp/modeling_layers/reversible_embedding/

We could consider opening up a parameter for this for high level tasks for specific architectures. E.g. a way to fine tune GPT2 with the embedding projection weights untied from the token embedding inputs (but initialized from the same checkpoint values). I am not sure how big of a use case there is for this though.

mattdangerw avatar Oct 18 '23 22:10 mattdangerw