transformer
transformer copied to clipboard
Weight matrix sharing confusion
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation.
Hello! I have the paper recently and find that this part was mentioned in 3.4 Embeddings and Softmax, but your code seemingly consider the embedding layer of ouput and the linear layer before softmax layer as separate ones, so I want to ask what's your consideration of this part?