annotated-transformer
annotated-transformer copied to clipboard
A poll on the Embedding class
def __init__(self, d_model, vocab):
super(Embeddings, self).__init__()
self.lut = nn.Embedding(vocab, d_model)
self.d_model = d_model
def forward(self, x):
return self.lut(x) * math.sqrt(self.d_model)
Do you know why return the look up table by multiplying a constant d_model?
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to (cite). In the embedding layers, we multiply those weights by $\sqrt{d_{\text{model}}}$.
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to (cite). In the embedding layers, we multiply those weights by
dmodel
Thanks, does this paper explain why to multiply the constant d_model mathematically?
Not sure but in my DL lectures said "for numerical stability" :)