Knover icon indicating copy to clipboard operation
Knover copied to clipboard

损失函数entropy_loss和bow_loss

Open guijuzhejiang opened this issue 3 years ago • 9 comments

请问entropy_loss和bow_loss具体评估的是什么? 我训练过程中entropy_loss变为nan,bow_loss降到5左右就不下降了,这样正常吗? 什么情况下要使用这两个损失,也就是设置use_entropy,use_bow为true呢?

guijuzhejiang avatar Jan 05 '22 10:01 guijuzhejiang

entropy_loss是预测latent输出的loss,bow_loss是回答句子的loss吗?

guijuzhejiang avatar Jan 05 '22 11:01 guijuzhejiang

entropy_loss 是为了让各个 latent 的输出分布均匀些,以免个别 latent 的概率过大(变为 nan 是比较奇怪的,可以发一下 log看看是什么情况) bow_loss 是为了提升 latent 的选择能力和表达能力,bow_loss 低的 latent 更适合用于辅助生成当前回复的提示;因为 bow_loss 是拿一个概率分布拟合多个 one_hot 的分布(多个 token,每个 token 一个 one_hot 分布),所以 bow_loss 是下界不为0,降到5也是符合预期的

sserdoubleh avatar Jan 05 '22 14:01 sserdoubleh

这是训练plato阶段(2.1)截取的log,我看其他的loss好像都可以,就是不知道为什么entropy_loss: nan。 [train][62] progress: 2/3 step: 608780, time: 7.482, queue size: 64, speed: 2.673 steps/s current lr: 0.0000011 lm_loss: 0.0430, ppl: 1.0439, loss: 4.0380, bow_loss: 3.9950, entropy_loss: nan, loss_scaling: 157281.4219 [train][62] progress: 2/3 step: 608800, time: 7.493, queue size: 64, speed: 2.669 steps/s current lr: 0.0000011 lm_loss: 0.0851, ppl: 1.0888, loss: 4.0096, bow_loss: 3.9245, entropy_loss: nan, loss_scaling: 157281.4219 [train][62] progress: 2/3 step: 608820, time: 7.457, queue size: 64, speed: 2.682 steps/s current lr: 0.0000011 lm_loss: 0.0438, ppl: 1.0447, loss: 3.9613, bow_loss: 3.9176, entropy_loss: nan, loss_scaling: 157281.4219 [train][62] progress: 2/3 step: 608840, time: 7.465, queue size: 64, speed: 2.679 steps/s current lr: 0.0000011 lm_loss: 0.0430, ppl: 1.0439, loss: 4.2938, bow_loss: 4.2508, entropy_loss: nan, loss_scaling: 157281.4219 [train][62] progress: 2/3 step: 608860, time: 7.476, queue size: 64, speed: 2.675 steps/s current lr: 0.0000011 lm_loss: 0.0503, ppl: 1.0516, loss: 4.1240, bow_loss: 4.0737, entropy_loss: nan, loss_scaling: 157281.4219 [train][62] progress: 2/3 step: 608880, time: 7.507, queue size: 64, speed: 2.664 steps/s current lr: 0.0000011 lm_loss: 0.1036, ppl: 1.1092, loss: 4.5535, bow_loss: 4.4498, entropy_loss: nan, loss_scaling: 157281.4219 [train][62] progress: 2/3 step: 608900, time: 7.468, queue size: 64, speed: 2.678 steps/s current lr: 0.0000011 lm_loss: 0.1576, ppl: 1.1707, loss: 4.5963, bow_loss: 4.4388, entropy_loss: nan, loss_scaling: 157281.4219 [train][62] progress: 2/3 step: 608920, time: 7.462, queue size: 64, speed: 2.680 steps/s current lr: 0.0000011 lm_loss: 0.0510, ppl: 1.0523, loss: 4.0235, bow_loss: 3.9726, entropy_loss: nan, loss_scaling: 157281.4219 [train][62] progress: 2/3 step: 608940, time: 7.456, queue size: 64, speed: 2.683 steps/s current lr: 0.0000011 lm_loss: 0.0622, ppl: 1.0641, loss: 4.0297, bow_loss: 3.9676, entropy_loss: nan, loss_scaling: 157281.4219 [train][62] progress: 2/3 step: 608960, time: 7.463, queue size: 64, speed: 2.680 steps/s current lr: 0.0000011 lm_loss: 0.0514, ppl: 1.0528, loss: 4.1006, bow_loss: 4.0492, entropy_loss: nan, loss_scaling: 157281.4219 [train][62] progress: 2/3 step: 608980, time: 7.444, queue size: 64, speed: 2.687 steps/s current lr: 0.0000011 lm_loss: 0.0640, ppl: 1.0661, loss: 4.0933, bow_loss: 4.0293, entropy_loss: nan, loss_scaling: 157281.4219 [train][62] progress: 2/3 step: 609000, time: 7.484, queue size: 64, speed: 2.672 steps/s current lr: 0.0000011 lm_loss: 0.1101, ppl: 1.1164, loss: 4.3136, bow_loss: 4.2035, entropy_loss: nan, loss_scaling: 157281.4219 [train][62] progress: 2/3 step: 609020, time: 7.477, queue size: 64, speed: 2.675 steps/s current lr: 0.0000011 lm_loss: 0.0561, ppl: 1.0577, loss: 3.8742, bow_loss: 3.8181, entropy_loss: nan, loss_scaling: 157281.4219

guijuzhejiang avatar Jan 06 '22 00:01 guijuzhejiang

非常感谢您对entropy_loss和bow_loss的解释。entropy_loss开始正常,后面慢慢变成了nan,我设置latent=10. [train][2] progress: 2/3 step: 17020, time: 7.476, queue size: 64, speed: 2.675 steps/s current lr: 0.0000069 lm_loss: 0.3731, ppl: 1.4522, loss: 5.6612, bow_loss: 5.2881, entropy_loss: -0.4200, loss_scaling: 419430.4062 [train][2] progress: 2/3 step: 17040, time: 7.443, queue size: 64, speed: 2.687 steps/s current lr: 0.0000069 lm_loss: 0.3204, ppl: 1.3777, loss: 6.1795, bow_loss: 5.8591, entropy_loss: -0.4292, loss_scaling: 419430.4062 [train][2] progress: 2/3 step: 17060, time: 7.459, queue size: 64, speed: 2.681 steps/s current lr: 0.0000068 lm_loss: 0.3063, ppl: 1.3584, loss: 6.3760, bow_loss: 6.0697, entropy_loss: nan, loss_scaling: 419430.4062 [train][2] progress: 2/3 step: 17080, time: 7.497, queue size: 64, speed: 2.668 steps/s current lr: 0.0000068 lm_loss: 0.2856, ppl: 1.3306, loss: 5.7847, bow_loss: 5.4990, entropy_loss: -0.3148, loss_scaling: 419430.4062 [train][2] progress: 2/3 step: 17100, time: 7.446, queue size: 64, speed: 2.686 steps/s current lr: 0.0000068 lm_loss: 0.6854, ppl: 1.9845, loss: 6.5226, bow_loss: 5.8373, entropy_loss: -0.4635, loss_scaling: 419430.4062 [train][2] progress: 2/3 step: 17120, time: 7.496, queue size: 64, speed: 2.668 steps/s

guijuzhejiang avatar Jan 06 '22 02:01 guijuzhejiang

非常感谢您对entropy_loss和bow_loss的解释。entropy_loss开始正常,后面慢慢变成了nan,我设置latent=10. [train][2] progress: 2/3 step: 17020, time: 7.476, queue size: 64, speed: 2.675 steps/s current lr: 0.0000069 lm_loss: 0.3731, ppl: 1.4522, loss: 5.6612, bow_loss: 5.2881, entropy_loss: -0.4200, loss_scaling: 419430.4062 [train][2] progress: 2/3 step: 17040, time: 7.443, queue size: 64, speed: 2.687 steps/s current lr: 0.0000069 lm_loss: 0.3204, ppl: 1.3777, loss: 6.1795, bow_loss: 5.8591, entropy_loss: -0.4292, loss_scaling: 419430.4062 [train][2] progress: 2/3 step: 17060, time: 7.459, queue size: 64, speed: 2.681 steps/s current lr: 0.0000068 lm_loss: 0.3063, ppl: 1.3584, loss: 6.3760, bow_loss: 6.0697, entropy_loss: nan, loss_scaling: 419430.4062 [train][2] progress: 2/3 step: 17080, time: 7.497, queue size: 64, speed: 2.668 steps/s current lr: 0.0000068 lm_loss: 0.2856, ppl: 1.3306, loss: 5.7847, bow_loss: 5.4990, entropy_loss: -0.3148, loss_scaling: 419430.4062 [train][2] progress: 2/3 step: 17100, time: 7.446, queue size: 64, speed: 2.686 steps/s current lr: 0.0000068 lm_loss: 0.6854, ppl: 1.9845, loss: 6.5226, bow_loss: 5.8373, entropy_loss: -0.4635, loss_scaling: 419430.4062 [train][2] progress: 2/3 step: 17120, time: 7.496, queue size: 64, speed: 2.668 steps/s

这是有些不正常,你有设置 use_entropy 为 true吗?我感觉可能是因为分布非常集中导致的

你可以试下在 https://github.com/PaddlePaddle/Knover/blob/9d0db786dca9c575b40eb5776c6620bbd6657070/knover/models/plato.py#L276 上面加一行

layers.Print(outputs["post_probs"], message="post_probs", summairze=-1)

sserdoubleh avatar Jan 07 '22 04:01 sserdoubleh

谢谢回复,分布非常集中是指latent的分布吗?数据上体现在哪里呢? 我还没有设置use_entropy=true,我设置为true试试看

guijuzhejiang avatar Jan 08 '22 03:01 guijuzhejiang

感谢你的指导,设置use_entropy=true后,entropy_loss确实正常了,训练到后面降到-0.0001 [train][29] progress: 1/3 step: 218140, time: 7.781, queue size: 64, speed: 2.570 steps/s current lr: 0.0000019 lm_loss: 0.0137, ppl: 1.0138, loss: 4.9016, bow_loss: 4.8879, entropy_loss: -0.0001, loss_scaling: 313855.4688 [train][29] progress: 1/3 step: 218160, time: 7.810, queue size: 64, speed: 2.561 steps/s current lr: 0.0000019 lm_loss: 0.0174, ppl: 1.0175, loss: 4.7110, bow_loss: 4.6936, entropy_loss: -0.0001, loss_scaling: 313855.4688 [train][29] progress: 1/3 step: 218180, time: 7.768, queue size: 64, speed: 2.575 steps/s current lr: 0.0000019 lm_loss: 0.0243, ppl: 1.0246, loss: 5.1143, bow_loss: 5.0901, entropy_loss: -0.0001, loss_scaling: 313855.4688 [train][29] progress: 1/3 step: 218200, time: 7.783, queue size: 64, speed: 2.570 steps/s current lr: 0.0000019 lm_loss: 0.0274, ppl: 1.0278, loss: 4.7470, bow_loss: 4.7196, entropy_loss: -0.0001, loss_scaling: 313855.4688 [train][29] progress: 1/3 step: 218220, time: 7.779, queue size: 64, speed: 2.571 steps/s current lr: 0.0000019 lm_loss: 0.4484, ppl: 1.5658, loss: 5.6762, bow_loss: 5.2278, entropy_loss: -0.0001, loss_scaling: 313855.4688 [train][29] progress: 1/3 step: 218240, time: 7.783, queue size: 64, speed: 2.570 steps/s current lr: 0.0000019 lm_loss: 0.0222, ppl: 1.0225, loss: 4.2704, bow_loss: 4.2482, entropy_loss: -0.0001, loss_scaling: 313855.4688

guijuzhejiang avatar Jan 09 '22 06:01 guijuzhejiang

谢谢回复,分布非常集中是指latent的分布吗?数据上体现在哪里呢? 我还没有设置use_entropy=true,我设置为true试试看

对,是指 latent 的分布

感谢你的指导,设置use_entropy=true后,entropy_loss确实正常了,训练到后面降到-0.0001 [train][29] progress: 1/3 step: 218140, time: 7.781, queue size: 64, speed: 2.570 steps/s current lr: 0.0000019 lm_loss: 0.0137, ppl: 1.0138, loss: 4.9016, bow_loss: 4.8879, entropy_loss: -0.0001, loss_scaling: 313855.4688 [train][29] progress: 1/3 step: 218160, time: 7.810, queue size: 64, speed: 2.561 steps/s current lr: 0.0000019 lm_loss: 0.0174, ppl: 1.0175, loss: 4.7110, bow_loss: 4.6936, entropy_loss: -0.0001, loss_scaling: 313855.4688 [train][29] progress: 1/3 step: 218180, time: 7.768, queue size: 64, speed: 2.575 steps/s current lr: 0.0000019 lm_loss: 0.0243, ppl: 1.0246, loss: 5.1143, bow_loss: 5.0901, entropy_loss: -0.0001, loss_scaling: 313855.4688 [train][29] progress: 1/3 step: 218200, time: 7.783, queue size: 64, speed: 2.570 steps/s current lr: 0.0000019 lm_loss: 0.0274, ppl: 1.0278, loss: 4.7470, bow_loss: 4.7196, entropy_loss: -0.0001, loss_scaling: 313855.4688 [train][29] progress: 1/3 step: 218220, time: 7.779, queue size: 64, speed: 2.571 steps/s current lr: 0.0000019 lm_loss: 0.4484, ppl: 1.5658, loss: 5.6762, bow_loss: 5.2278, entropy_loss: -0.0001, loss_scaling: 313855.4688 [train][29] progress: 1/3 step: 218240, time: 7.783, queue size: 64, speed: 2.570 steps/s current lr: 0.0000019 lm_loss: 0.0222, ppl: 1.0225, loss: 4.2704, bow_loss: 4.2482, entropy_loss: -0.0001, loss_scaling: 313855.4688

这里的 entropy loss 刚开始是多少呢?

sserdoubleh avatar Jan 11 '22 03:01 sserdoubleh

entropy loss 刚开始是-2.9左右 [train][1] progress: 2/3 step: 80, time: 7.979, queue size: 64, speed: 2.507 steps/s current lr: 0.0000008 lm_loss: 0.5734, ppl: 1.7743, loss: 10.2616, bow_loss: 9.6882, entropy_loss: -2.9593, loss_scaling: 32768.0000 [train][1] progress: 2/3 step: 100, time: 7.944, queue size: 64, speed: 2.518 steps/s current lr: 0.0000010 lm_loss: 0.0725, ppl: 1.0751, loss: 9.0786, bow_loss: 9.0061, entropy_loss: -2.9458, loss_scaling: 32768.0000 [train][1] progress: 2/3 step: 120, time: 8.002, queue size: 64, speed: 2.499 steps/s current lr: 0.0000012 lm_loss: 0.0766, ppl: 1.0797, loss: 8.4782, bow_loss: 8.4016, entropy_loss: -2.9511, loss_scaling: 32768.0000

guijuzhejiang avatar Jan 12 '22 08:01 guijuzhejiang