Reinforcement-learning-with-tensorflow
Reinforcement-learning-with-tensorflow copied to clipboard
PPO and Reward
Hello Zhou: I get confused about how does the Reward work to guide the PPO to train the ANNs?
1、For example,I feed a batch_size data to the ANNs,then I will get a Reward,and the reward is working by the PPO Gradient formula?(I mean in the Gradient formula,we multiply by R )
2、When I get a series of rewrad ,I want to konw if my idea is right.That is I add every reward ,if I get a new one ,I add it too,then get a maximum result with the PPO Gradient formula,and then the best policy!Am I right?
Please do not mind my terrible English and I hope my description is clear!
Thank you for your advise!