trpo
trpo copied to clipboard
About kl_firstfixed
thanks for implementation of trpo, there exist some details that do not make sense to me so far
I can't see why kl_firstfixed is defined as following
kl_firstfixed = tf.reduce_sum(tf.stop_gradient( action_dist_n) * tf.log(tf.stop_gradient(action_dist_n + eps) / (action_dist_n + eps))) / Nf
seems that we didn't make use of anything of oldaction_dist
shouldn't it be
kl_firstfixed = tf.reduce_sum(tf.stop_gradient( oldaction_dist) * tf.log(tf.stop_gradient(oldaction_dist + eps) / (action_dist_n + eps))) / Nf
?
besides, why does losses contain the entropy of action_dist_n? why must it be minimized?
sorry, I mean I think it should be
kl_firstfixed = tf.reduce_sum(tf.stop_gradient( oldaction_dist) * tf.log(tf.stop_gradient(oldaction_dist + eps) / (oldaction_dist + eps))) / Nf
All right, after a quick analysis, I think it' s reasonable to use the first definition of kl_first, yet I'm still confused about the losses, why do we try to minimize three values?