DiffuSeq Why no attention mask used?

Why no attention mask used?

Open DaoD opened this issue 2 years ago • 1 comments

trafficstars

In vanilla BERT, we should input attention mask to avoid performing attention on padding positions. However, in this code repository, no attention mask is used. Is there any reason for this design?

Thanks!

Dec 07 '22 07:12 DaoD

Currently, the pad is treated as a regular token, and the generated length could change in the generation process. It can avoid the need for an additional length prediction module, which differs from other NAR models. In other words, the model can control the generation of different lengths of content by predicting the pad.

Dec 07 '22 09:12 summmeer

DiffuSeq DiffuSeq copied to clipboard

Why no attention mask used?

DiffuSeq
DiffuSeq copied to clipboard