graphium
graphium copied to clipboard
Consider muP suggestions made in Appendix D of Tensor Programs V
Appendix D of the Tensor Programs V paper contains a number of practical suggestions for using muP which we would do well to consider, such as:
- fixing the dimension of each attention head as the model is scaled (D.4)
- using normal/Gaussian initialisation instead of uniform initialisation (D.5)
- use zeros to initialise attention query layers and "output" layers (those that map from scaled to non-scaled dimension) (D.2)
- tuning "input, output, and attention multipliers" (need to check exactly what this means) (D.7)