graphium icon indicating copy to clipboard operation
graphium copied to clipboard

Consider muP suggestions made in Appendix D of Tensor Programs V

Open callumm-graphcore opened this issue 2 years ago • 0 comments

Appendix D of the Tensor Programs V paper contains a number of practical suggestions for using muP which we would do well to consider, such as:

  • fixing the dimension of each attention head as the model is scaled (D.4)
  • using normal/Gaussian initialisation instead of uniform initialisation (D.5)
  • use zeros to initialise attention query layers and "output" layers (those that map from scaled to non-scaled dimension) (D.2)
  • tuning "input, output, and attention multipliers" (need to check exactly what this means) (D.7)

callumm-graphcore avatar Feb 01 '23 13:02 callumm-graphcore