fast_abs_rl A question about rl training function

A question about rl training function

Open ZefanW opened this issue 5 years ago • 2 comments

for action, p, r, b in zip(indices, probs, reward, baseline): advantage = r - b avg_advantage += advantage losses.append(-p.log_prob(action) * (advantage/len(indices))) # divide by T*B

I have a question about this piece of code. If I didn't get it wrong, the variable b here is tensor with gradient enabled, so optimizing tensors in losses will actually both optimize reward by changing policy weights and minimizing the advantage by maximizing baseline. I can't understand why the baseline is optimized here, because as far as I know, the baseline should only be optimized during the training of the critic. Actually I used this training function in a different summarization task, and I found that the avg_advantage is always dropping. Thank you very much.

May 05 '19 01:05 ZefanW

I changed r-b to (r-b).item(), and it seems alright.

May 05 '19 01:05 ZefanW

Thanks for pointing this out! I think your solution should work as intended. I will test how this affect the results when I have time.

May 08 '19 07:05 ChenRocks

fast_abs_rl fast_abs_rl copied to clipboard

A question about rl training function

fast_abs_rl
fast_abs_rl copied to clipboard