graphium
graphium copied to clipboard

Published 20 hours ago •

Reame
Issues

Consider muP suggestions made in Appendix D of Tensor Programs V

Open callumm-graphcore opened this issue 2 years ago • 0 comments

Appendix D of the Tensor Programs V paper contains a number of practical suggestions for using muP which we would do well to consider, such as:

fixing the dimension of each attention head as the model is scaled (D.4)
using normal/Gaussian initialisation instead of uniform initialisation (D.5)
use zeros to initialise attention query layers and "output" layers (those that map from scaled to non-scaled dimension) (D.2)
tuning "input, output, and attention multipliers" (need to check exactly what this means) (D.7)

Feb 01 '23 13:02 callumm-graphcore