DiffuSeq
DiffuSeq copied to clipboard
Why no attention mask used?
trafficstars
In vanilla BERT, we should input attention mask to avoid performing attention on padding positions. However, in this code repository, no attention mask is used. Is there any reason for this design?
Thanks!
Currently, the pad is treated as a regular token, and the generated length could change in the generation process. It can avoid the need for an additional length prediction module, which differs from other NAR models. In other words, the model can control the generation of different lengths of content by predicting the pad.