policy-gradient icon indicating copy to clipboard operation
policy-gradient copied to clipboard

Loss function/Labels for neural network used?

Open abhigenie92 opened this issue 7 years ago • 2 comments

I do understand the backpropagation in policy gradient networks, but am not sure how your code work keras's auto-differentiation.

That is, how you transform it into a supervised learning problem. For example, the code below:

Y = self.probs + self.learning_rate * np.squeeze(np.vstack([gradients]))

Why is Y not 1-hot vector for the action taken? You compute the gradient assuming the action is correct, Y is one-hot vector. Then you multiplies it by the reward in the corresponding time-step. But while training you feed it as the correction. I think one could multiply the rewards by one-hot vector instead. And then feed it straight away.

If possible please clarify my doubt. :) https://github.com/keon/policy-gradient/blob/master/pg.py#L67

abhigenie92 avatar Jun 25 '17 02:06 abhigenie92

opt = Adam(lr=self.learning_rate)
model.compile(loss='categorical_crossentropy', optimizer=opt)

First, I think the loss should be gradient = (y-prob)*reward. Second, we already set the learning_rate of opt.

So, Y should be self.probs + np.vstack([gradients]) ? Y-Y_predict = Y - self.probs = np.vstack([gradients])

LinkToPast1990 avatar Mar 14 '19 02:03 LinkToPast1990

https://github.com/gabrielgarza/openai-gym-policy-gradient/blob/master/policy_gradient_layers.py

LinkToPast1990 avatar Mar 14 '19 02:03 LinkToPast1990