transformer
transformer copied to clipboard
Questions about attentoin
When I check the attention image, I found that the model actually put some attention in padding position. I think this is because the (1)embedding using dropout, (2)After the first block, the K and Q will lose the info for "Padding"(because the zero will changed after ff layer). So the mask function didn't work.