RLSeq2Seq
RLSeq2Seq copied to clipboard
A problem about Q updates
Hello! I can't understand this (389 - 407 line in run_summarization.py), why the "dqn_best_action" use state other than state_prime ? I think dist_q_val = -tf.log(dist) * q_value (model.py) which means we should let dist and q_value be close each other , right ? Shouldn't we use ||Q-q||^2 (https://arxiv.org/pdf/1805.09461.pdf Eq. 29)
# 389 line
q_estimates = dqn_results['estimates'] # shape (len(transitions), vocab_size)
dqn_best_action = dqn_results['best_action']
#dqn_q_estimate_loss = dqn_results['loss']
# use target DQN to estimate values for the next decoder state
dqn_target_results = self.dqn_target.run_test_steps(self.dqn_sess, x= b_prime._x)
q_vals_new_t = dqn_target_results['estimates'] # shape (len(transitions), vocab_size)
# 407 line
q_estimates[i][tr.action] = tr.reward + FLAGS.gamma * q_vals_new_t[i][dqn_best_action[i]]