关于 PLATO-2 和 PLATO 的模型区别

Open kiseliu opened this issue 3 years ago • 1 comments

除了论文中提到的 pre-norm 和 post-norm 的区别，以及 tokenizer 的区别，

我对比了下 plato 的网络结构和 plato-2 (stage 2.1 PLATO模型) 的网络结构，发现也有细微区别：

1、在预测 latent variable 的时候，plato 1 中的实现的是 mask token 的 final hidden state 经过 post_network；而plato-2 中，我理解 recognition_fc 这一层是为了取出 mask token 的 final hidden state，然后 post_network 用 (latent_embedding, recognition_bias) 给替代了；

2、plato 1 中，计算 NLL loss 的时候(generation network)，response 中所有 token 的 final hidden states，上面没有接分类器，而是和 word embedding 共享参数；而 plato 2 中，response 中所有 token 的 final hidden states，还要经过一层 mask_lm_trans_fc 和一层 layer norm，然后和 word embedding 共享参数时，还多了个偏置 mask_lm_out_fc.b_0；

3、计算 bow loss 的时候，和计算 NLL loss 的改动一样，多了一层 bow_trans_fc 和一层 layer norm，以及偏置 bow_out_fc.b_0；

我不知道上述理解是否正确，以及这种改动上的设计是为了？

Jun 09 '22 14:06 kiseliu

网络的变动，对模型的效果差异不大，主要是为了对齐 BERT 的模型结构 / 更多地共享参数

Jun 10 '22 13:06 sserdoubleh