seq2seq-summarizer About rl

About rl_loss

Open xcfcode opened this issue 6 years ago • 0 comments

Thank you for this excellent job, I still have some questions about rl_loss, rl_loss = neg_reward * sample_out.loss, the neg_reward is obtained by greedy_rouge - sample_rouge, and the sample_out.loss means the cross-entropy loss, it is equal to -LogP(). However, in the paper, self-critical policy gradient training algorithm uses LogP(), this confused me, could you please explain this?

Update I have read SeqGAN code from SeqGAN, according to the policy gradient, the loss is computed as loss += -out[j][target.data[i][j]]*reward[j], out means Log_softmax, so the author adds "-" to using gradient descent later.

Feb 16 '19 02:02 xcfcode

seq2seq-summarizer seq2seq-summarizer copied to clipboard

About rl_loss

seq2seq-summarizer
seq2seq-summarizer copied to clipboard