lola LOLA Policy Gradient Target Computation

LOLA Policy Gradient Target Computation

Open dkkim93 opened this issue 5 years ago • 0 comments

Hello, thank you for open-sourcing the code! :-) The code is really helpful in understanding the papers deeper.

I am interested in LOLA, especially its policy gradient method (lola/train_pg.py). As mentioned in the paper, this implementation shows the actor-critic method.

However, I could not fully understand the target computation code: self.target = self.sample_return + self.next_v (code). According to the reference (chapter 13, page 274, one-step actor-critic pseudocode), I wonder whether the target computation should use the step reward (i.e., reward at timestep t) instead of the return.

Thank you for your time and consideration!

Oct 21 '19 16:10 dkkim93

lola lola copied to clipboard

LOLA Policy Gradient Target Computation

lola
lola copied to clipboard