MTR
MTR copied to clipboard
MultiheadAttention Usage in Decoder
I really appreciate your work, but I encountered some questions while reviewing the code. In the figure of the paper, the output of the first multihead attention in the decoder is fed into the second multihead attention. However, I couldn't find this implementation in the code.