higgsfield
higgsfield copied to clipboard
Error in 1.actor-critic.ipynb
Through debugging I found an interesting bug in your code. let's set the number of environments to 1 and the number of steps to 32. It can happen that the environment terminates one or more times within the 32 steps. In that case, the function compute_returns()
will return a wrong return as through the masks
the initial estimate of next_value
will be lost...
When you use SubprocVecEnv
as a wrapper to have multiple actors in parallel, the environments automatically call the reset()
method after the end of an episode.
I think the compute_returns()
is correct because it goes backward:
for step in reversed(range(len(rewards))):
R = rewards[step] + gamma * R * masks[step]
if for a given step x in [0, 32 - 1[, the environment ends, at this step, R
value will be equal to rewards[step]
which is good because it means that we have done a roll-out and we don't need the initial estimate of next_value
anymore.