Question on Chapter 18 - loss functions
Greetings, I'm working through the cartpole example on page 695 of the third edition, and I have a question about the code presented:
def play_one_step(env, obs, model, loss_fn):
with tf.GradientTape() as tape:
left_proba = model(obs[np.newaxis])
action = (tf.random.uniform([1, 1]) > left_proba)
y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32)
loss = tf.reduce_mean(loss_fn(y_target, left_proba))
grads = tape.gradient(loss, model.trainable_variables)
obs, reward, done, truncated, info = env.step(int(action))
return obs, reward, done, truncated, grads
I'm confused about y_target, and why it's an input into the loss function. If the action is False (0), y_target is 1. If the action is True (1), y_target is 0. It seems like we are effectively saying that the model should have been more confident in whatever it's output was. Is that the correct way to think about what y_target is accomplishing? If so, is there something happening in a later step where we're determining if the action recommended by the model was beneficial?
I have similar questions about the loss function presented on page 710, but if I can get some clarification on this earlier example, perhaps I'll understand the more challenging Q-value example.
Thank you!
Hi @jab2727 ,
Thanks for your question. You are correct: we are indeed pretending that whatever action the model chose was the correct one, and we're saving the corresponding gradients. Later in the notebook, we determine whether the action was actually good or not, and based on that info we follow the gradient vector in one direction or the other.
Hope this helps!
Ok, thanks so much for the quick response, very helpful. On page 710 we have the following DQN loss function:
def training_step(batch_size):
experiences = sample_experiences(batch_size)
states, actions, rewards, next_states, dones, truncateds = experiences
next_Q_values = model.predict(next_states, verbose=0)
max_next_Q_values = next_Q_values.max(axis=1)
runs = 1.0 - (dones | truncateds)
target_Q_values = rewards + runs * discount_factor * max_next_Q_values
target_Q_values = target_Q_values.reshape(-1,1)
print("The target_Q_values is: ")
print(target_Q_values)
mask = tf.one_hot(actions, n_outputs)
with tf.GradientTape() as tape:
all_Q_values = model(states)
Q_values = tf.reduce_sum(all_Q_values * mask, axis=1, keepdims=True)
print("The actual Q values are: ")
print(Q_values)
loss = tf.reduce_mean(loss_fn(target_Q_values, Q_values))
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
Please correct me if I'm wrong, but this example is calculating the loss in a very different way. We're not assuming the action was correct and then determining how many points were earned in the discounting step. We're estimating how many points can be earned in the future with target_Q_values, comparing that to what was actually earned, and feeding those two values into the loss function.
If that's correct, I'm reading through the book's explanation of what's happening in the code, but I'm having trouble understanding what's going on from the mask down. The mask appears to zero out the Q-values, but I'm not clear on how it's only selecting the "ones we do not want". Also, instead of computing the Q-value for every state, would it be possible to instead compute only the Q-value for the single state that produced the max_next_Q_values?
Thank you again!