CBART
CBART copied to clipboard
About full mask
If the full mask equals to 1, the decoder can be seen as bert, right?
From the view of model structure, I think yes. But there are differences between the training paradigm between Bert and Bart decoder (Bert has next sentence prediction and masked token completion tasks for pre-training, while bart decoder uses cross-attention from bart encoder and is pre-trained autoregressively).
Is there any lexical constraints set such as T5 prefix for training? Is it correct that the keyword is included in the training data? You don't specify a keyword prefix like {c1, c2, c3, c4} ?