async_deep_reinforce
async_deep_reinforce copied to clipboard
why recalculate pi and v?
Hello, in game_ac_network.py, def prepare_loss(self, entropy_beta), you have:
# temporary difference (R-V) (input for policy)
self.td = tf.placeholder("float", [None])
value_loss = 0.5 * tf.nn.l2_loss(self.r - self.v)
But td == self.r-self.v, right?
So, why not use self.td directly instead of recalculating self.v ? Also for pi, why not pass it as placeholder?
Hope reply thanks.
Because self.td is a fed in number(s) used in the policy gradient. You use self.r - self.v to calculate critic losses.