Q-Optimality-Tightening
Q-Optimality-Tightening copied to clipboard
Question on upper bound
@ShibiHe , First of all, thanks for this inspiring paper and implementation, great work!
In paper, you use index substitution to derive the upper bound for Q, which perfectly makes sense mathematically.
However, in implementation, Upper bound is used the same way as Lower bound, without dependency(thus gradient) w.r.t. parameters.
Which means, for example, at time step t, in trajectory (s[t-2], a[t-2], r[t-2], s[t-1], a[t-1], r[t-1], s[t], a[t], r[t], ...), if r[t-2] and r[t-1] is very low, we need to decrease the value of Q[t] according to upper bounds introduced by r[t-2], r[t-1].
which means essentially what happened before time step t will have impact on the value Q[t].
Does that conflict with definition of Discounted Future Reward and also the assumption of MDP?
Please correct me if anything wrong,
Thanks!
Good question. Theoretically, we should only use the upper bounds after the Q is sufficiently trained and we find upper bounds stabilize the training. In practice, we just use the upper bounds from the beginning for simplicity.
Thanks for the reply! It works indeed.