RetroMAE Question about enhanced decoding

Question about enhanced decoding

Open mtybadger opened this issue 1 year ago • 9 comments

Hi staoxiao,

I wanted to ask more about how the enhanced decoding works - it looks like it generates 256 random possible attention masks, and then picks randomly from that list for each token. The label is just the original string with [CLS] and [SEP] masked for loss purposes.

It looks like the input that the decoder layer gets is the CLS output of the encoder concatenated with the rest of the original string. What I'm confused about is that some of the masks the decoder tokens get have a 0 in the first position, so does this mean that not all the tokens in the decoder layer actually get to see the CLS token at all? This surely means they can't use the information in it to reconstruct the original text?

I hope my question makes sense, and I appreciate if there's something weird going on here with the attention that I don't understand - I don't see these naked QKV attention layers very often.

Spruce

Oct 19 '23 00:10 mtybadger

RetroMAE RetroMAE copied to clipboard

Question about enhanced decoding

RetroMAE
RetroMAE copied to clipboard