attention-is-all-you-need-pytorch
attention-is-all-you-need-pytorch copied to clipboard
Linear layers use Kaiming initialization
Thank you for the code! By default, Pytorch linear layers use uniform Kaiming initialization, which is meant to be used before a ReLU activation with uniform [+1, -1] inputs. However, this initialization is used at the output projections of the attention and feed-forward submodules. With this initialization, training loss does not decrease on the Ubuntu conversation corpus. I found that switching to Xavier (smaller init) lead to convergence, while even smaller values (xavier * 1/100) lead to faster convergence. This is not surprising considering each submodule uses skip connections (x+f(x)), so smaller outputs lead to a similar variance between the input and output of a given Transformer layer, which is what these initialization methods aim to achieve. The LayerNorm corrects any changes in activation variance, but this does not stop layer outputs from initializing as increasingly non-linear early in training, which can complicate gradients. Also, I don't believe Attention Is All You Need mentions initialization, but I could be wrong. Thanks.