GA3C
GA3C copied to clipboard
Cannot learn problems with a single, terminal reward
Thank you for the easy to use and fast A3C implementation. I created a simple problem for rapid testing that rewards 0 on all steps except the terminal step, where it rewards either -1 or 1. GA3C cannot learn this problem because of line 107 in ProcessAgent.py:
terminal_reward = 0 if done else value
which causes the agent to ignore the only meaningful reward in this environment, and line 63 in ProcessAgent.py:
return experiences[:-1]
which causes the agent to ignore the only meaningful experience in this environment.
This is easily fixed by changing line 107 in ProcessAgent.py to
terminal_reward = reward if done else value
and _accumulate_rewards() in ProcessAgent.py to return all experiences if the agent has taken a terminal step. These changes should generally increase performance as terminal steps often contain valuable reward signal.
Hi, William! I think you are right. But have you validated it by conducting some experiments?
On my toy problem which only has a nonzero reward on the terminal step, the agent cannot learn without this change. I haven't tested this on more complex problems like Atari games (I imagine that the impact shouldn't be too large as these games usually have many rewards on non-terminal steps)
Hi, thanks for noticing this. Our implementation of A3C is indeed coherent with the original algo (see https://arxiv.org/pdf/1602.01783.pdf, page 14, where the reward is set to 0 for a terminal state). My intuition is that this is done because the expected value for the final state can only be zero (no rewards expected in the future). Nonetheless, your fix should allow using the algorithm also in the case of a game with only one, final reward. This being said, I am not sure that A3C is the best algorithm for this case - you may have to dramatically change some of the hyper-parameters of the algo (t_max, for instance) to see some convergence, and in any case I do not expect convergence to be fast. This also obviously depends on the length of the episode in your toy-game.
A3C correctly sets the value of terminal states to 0, but keeps the reward these terminal states give ("R ← r_i + γR" in the A3C pseudocode). GA3C sets both the reward and value of the terminal state to 0. In Pong, for example, where the terminal state also has a reward of -1 or 1 (indicating getting scored on or scoring), this causes no useful learning to happen in the experiences of the last round of play.