RLSeq2Seq A problem about Q updates

A problem about Q updates

Open painterner opened this issue 6 years ago • 0 comments

Hello! I can't understand this (389 - 407 line in run_summarization.py), why the "dqn_best_action" use state other than state_prime ? I think dist_q_val = -tf.log(dist) * q_value (model.py) which means we should let dist and q_value be close each other , right ? Shouldn't we use ||Q-q||^2 (https://arxiv.org/pdf/1805.09461.pdf Eq. 29)

     # 389 line
     q_estimates = dqn_results['estimates'] # shape (len(transitions), vocab_size)
      dqn_best_action = dqn_results['best_action']
      #dqn_q_estimate_loss = dqn_results['loss']

      # use target DQN to estimate values for the next decoder state
      dqn_target_results = self.dqn_target.run_test_steps(self.dqn_sess, x= b_prime._x)
      q_vals_new_t = dqn_target_results['estimates'] # shape (len(transitions), vocab_size)
      
      # 407 line
      q_estimates[i][tr.action] = tr.reward + FLAGS.gamma * q_vals_new_t[i][dqn_best_action[i]]

Oct 26 '18 14:10 painterner

RLSeq2Seq RLSeq2Seq copied to clipboard

A problem about Q updates

RLSeq2Seq
RLSeq2Seq copied to clipboard