rlberry To be, or not to be A2C

I have doubts about our implementation of A2C. To me, it seems that what is actually implemented is REINFORCE with baseline (according to nomenclature in Sutton & Barto). In fact, we collect discounted rewards in: https://github.com/rlberry-py/rlberry/blob/014fcd38b13d09abd61ed55ea6bbd357c25a33d7/rlberry/agents/torch/a2c/a2c.py#L242-L253 and then we compute loss as in https://github.com/rlberry-py/rlberry/blob/014fcd38b13d09abd61ed55ea6bbd357c25a33d7/rlberry/agents/torch/a2c/a2c.py#L268-L277 by subtracting the value functions to the discounted rewards in the advantages. To my knowledge this is very similar to REINFORCE, the difference being the baseline provided by the value function. Apart from that, the codes look almost the same. I might be missing something but I thought A2C should have some sort of TD estimate of the advantages instead, is it correct?
I say that they look almost the same because in the current implementation of A2C we also normalize the rewards and the advantages. These two procedures do not seem part of the standard implementation, would you suggest that we keep them as default for A2C or we can do something more canonical like it is done now in REINFORCE already, with the possibility of normalizing rewards? Do you think it is also sound to normalize advantages? https://github.com/rlberry-py/rlberry/blob/014fcd38b13d09abd61ed55ea6bbd357c25a33d7/rlberry/agents/torch/a2c/a2c.py#L254-L256 https://github.com/rlberry-py/rlberry/blob/014fcd38b13d09abd61ed55ea6bbd357c25a33d7/rlberry/agents/torch/a2c/a2c.py#L268-L270
If I am correct I was thinking of giving in REINFORCE the possibility of having or not the baseline and change the code of A2C using TD estimate of advantages.

May 07 '22 14:05 riccardodv

Hello! Hopefully, I can answer these questions :)

REINFORCE with Baseline can use any state-dependent baseline (you don't even have to use advantages, you can use the Q values directly - but advantages give lower variance estimates). A2C is a simplification of A3C, which was a basic asynchronous implementation of an actor-critic. In order to estimate the Q values for the advantages, they use discounted rewards on trajectory fragments up to a certain length (20 in the original paper) plus the value of the last state. Actually, the code you highlighted is very similar to Algorithm S2 on the original paper.
Normalizing rewards is not common, but normalizing advantages is. I suggest that we leave only advantage normalization and add an option to toggle it.
That's fine. You could look into Generalized Advantage Estimation, a method for estimating advantages that combines estimates with different TDs. It is one of the most used advantage estimation methods for actor-critics.

May 08 '22 11:05 mmcenta

Thank you very much @mmcenta for the feedback and for the references!

May 08 '22 15:05 riccardodv

I also have another question: 4. Do you know what is the exploration bonus currently implemented in A2C and REINFORCE? It is not very clear to me. Furthermore, I was wondering also about rlberry.wrappers.uncertainty_estimator_wrapper which seems to be related but is not clear to me how, since that part of the library is not well documented I think.

May 09 '22 08:05 riccardodv

I suggest that we remove all these exploration bonuses from the deep RL agents. I think the code implementing these bonuses are not stable enough, and it's better not to mix standard implementations of the agents and exploration strategies for now.

May 09 '22 20:05 omardrwch

rlberry rlberry copied to clipboard

To be, or not to be A2C

rlberry
rlberry copied to clipboard