fast_abs_rl
fast_abs_rl copied to clipboard
A question about rl training function
for action, p, r, b in zip(indices, probs, reward, baseline): advantage = r - b avg_advantage += advantage losses.append(-p.log_prob(action) * (advantage/len(indices))) # divide by T*B
I have a question about this piece of code. If I didn't get it wrong, the variable b here is tensor with gradient enabled, so optimizing tensors in losses will actually both optimize reward by changing policy weights and minimizing the advantage by maximizing baseline. I can't understand why the baseline is optimized here, because as far as I know, the baseline should only be optimized during the training of the critic. Actually I used this training function in a different summarization task, and I found that the avg_advantage is always dropping. Thank you very much.
I changed r-b to (r-b).item(), and it seems alright.
Thanks for pointing this out! I think your solution should work as intended. I will test how this affect the results when I have time.