Q-Optimality-Tightening Question on upper bound

Question on upper bound

Open immars opened this issue 8 years ago • 2 comments

trafficstars

@ShibiHe , First of all, thanks for this inspiring paper and implementation, great work!

In paper, you use index substitution to derive the upper bound for Q, which perfectly makes sense mathematically.

However, in implementation, Upper bound is used the same way as Lower bound, without dependency(thus gradient) w.r.t. parameters.

Which means, for example, at time step t, in trajectory (s[t-2], a[t-2], r[t-2], s[t-1], a[t-1], r[t-1], s[t], a[t], r[t], ...), if r[t-2] and r[t-1] is very low, we need to decrease the value of Q[t] according to upper bounds introduced by r[t-2], r[t-1].

which means essentially what happened before time step t will have impact on the value Q[t].

Does that conflict with definition of Discounted Future Reward and also the assumption of MDP?

Please correct me if anything wrong,

Thanks!

Jun 16 '17 07:06 immars

Good question. Theoretically, we should only use the upper bounds after the Q is sufficiently trained and we find upper bounds stabilize the training. In practice, we just use the upper bounds from the beginning for simplicity.

Jun 23 '17 13:06 ShibiHe

Thanks for the reply! It works indeed.

Jul 24 '17 09:07 immars

Q-Optimality-Tightening Q-Optimality-Tightening copied to clipboard

Question on upper bound

Q-Optimality-Tightening
Q-Optimality-Tightening copied to clipboard