annotated-transformer A poll on the Embedding class

A poll on the Embedding class

Open gitfourteen opened this issue 1 year ago • 3 comments

    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

Do you know why return the look up table by multiplying a constant d_model?

Mar 22 '24 09:03 gitfourteen

In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to (cite). In the embedding layers, we multiply those weights by $\sqrt{d_{\text{model}}}$.

Sep 14 '24 18:09 kuraga

In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to (cite). In the embedding layers, we multiply those weights by
  dmodel

Thanks, does this paper explain why to multiply the constant d_model mathematically?

Sep 15 '24 12:09 gitfourteen

Not sure but in my DL lectures said "for numerical stability" :)

Sep 21 '24 11:09 kuraga

annotated-transformer annotated-transformer copied to clipboard

A poll on the Embedding class

annotated-transformer
annotated-transformer copied to clipboard