BiLLM
BiLLM copied to clipboard
Interaction of padding and bidirectional mask
Hi,
Thanks for sharing this very interesting work. I had a question about how the bidirectional attention mask is implemented here
Based on this implementation, it seems like even the padding tokens in a batch will get unmasked, whereas they should remain masked in both unidirectional and bidirectional attention. Is my understanding correct?