annotated-transformer
annotated-transformer copied to clipboard
An annotated implementation of the Transformer paper.
Not sure, what is wrong? Any suggestions /usr/local/lib/python3.7/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead. warnings.warn(warning.format(ret)) /usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:20: UserWarning: nn.init.xavier_uniform is now deprecated in favor of...
I ran the code on Google colab. When _building German vocabulary_ here: ``` if is_interactive_notebook(): # global variables used later in the script spacy_de, spacy_en = show_example(load_tokenizers) vocab_src, vocab_tgt =...
First of all: thank you for this work, it is really easy to follow along this notebook. My question is the following: In the MultiHeadedAttention class, you instantiate 4 affine...
class MultiHeadedAttention(nn.Module): def __init__(self, h, d_model, dropout=0.1): "Take in model size and number of heads." super(MultiHeadedAttention, self).__init__() assert d_model % h == 0 # We assume d_v always equals d_k...
``` if False: model.src_embed[0].lut.weight = model.tgt_embeddings[0].lut.weight model.generator.lut.weight = model.tgt_embed[0].lut.weight ``` Hi, I can't find `tgt_embeddings` in your code. Maybe it is `model.src_embed[0].lut.weight = model.tgt_embed[0].lut.weight`. And if shared Embedding, Should the...
Thanks for the great resource!
According to what you wrote: _“That is, the output of each sub-layer is $\mathrm{LayerNorm}(x + \mathrm{Sublayer}(x))$, where $\mathrm{Sublayer}(x)$ is the function implemented by the sub-layer itself. We apply dropout [(cite)](http://jmlr.org/papers/v15/srivastava14a.html)...
In the MultiheadAttention the line `self.linears = clones(nn.Linear(d_model, d_model), 4)` occurs, but it should be a 3 instead of a 4: `self.linears = clones(nn.Linear(d_model, d_model), 3)` am I correct?
The original [paper](https://arxiv.org/pdf/1607.06450.pdf) computes a biased estimate of sample standard deviation. However, by default, `torch.Tensor.std()` uses an unbiased estimate [Ref](https://pytorch.org/docs/1.11/generated/torch.Tensor.std.html?highlight=torch%20std#torch.Tensor.std). Therefore, it is necessary to use `torch.Tensor.std(-1,unbiased=False)`. Moreover, the class...
Hi, Great notebook! Just wanted to mention that there is no need to pass the `generator` in the constructor of the `EncoderDecoder` class. It makes it a bit confusing as...