pytorch-rl
pytorch-rl copied to clipboard
Taking 'done' into consideration while calculating returns
Hello, thank you for making this repo, I think while calculating the returns you should take done into consideration as,
def calculate_returns(self, rewards, dones, normalize = True):
returns = []
R = 0
for r, d in zip(reversed(rewards), reversed(dones)):
if d:
R = 0
R = r + R * self.gamma
returns.insert(0, R)
returns = torch.tensor(returns).to(device)
if normalize:
returns = (returns - returns.mean()) / returns.std()
return returns
Also can you please briefly describe the Generalized Advantage Estimation (GAE) while calculating the advantages.
Notebooks 1-7 all use Monte Carlo methods. That is each environment is run for a single episode, i.e. until the environment returns done = True, after which we then calculate the returns/advantages and update the policy parameters.
There is no need to check for done in the calculation of the returns/advantages as only the last state will have done = True, which is why R is initialized to zero.
I'll add the explanation to GAE when I get around to adding more detail to the notebooks - for now I'd recommend these two links:
- https://datascience.stackexchange.com/questions/32480/how-does-generalised-advantage-estimation-work
- https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html