Q-Optimality-Tightening icon indicating copy to clipboard operation
Q-Optimality-Tightening copied to clipboard

Question on upper bound

Open immars opened this issue 8 years ago • 2 comments
trafficstars

@ShibiHe , First of all, thanks for this inspiring paper and implementation, great work!

In paper, you use index substitution to derive the upper bound for Q, which perfectly makes sense mathematically.

However, in implementation, Upper bound is used the same way as Lower bound, without dependency(thus gradient) w.r.t. parameters.

Which means, for example, at time step t, in trajectory (s[t-2], a[t-2], r[t-2], s[t-1], a[t-1], r[t-1], s[t], a[t], r[t], ...), if r[t-2] and r[t-1] is very low, we need to decrease the value of Q[t] according to upper bounds introduced by r[t-2], r[t-1].

which means essentially what happened before time step t will have impact on the value Q[t].

Does that conflict with definition of Discounted Future Reward and also the assumption of MDP?

Please correct me if anything wrong,

Thanks!

immars avatar Jun 16 '17 07:06 immars

Good question. Theoretically, we should only use the upper bounds after the Q is sufficiently trained and we find upper bounds stabilize the training. In practice, we just use the upper bounds from the beginning for simplicity.

ShibiHe avatar Jun 23 '17 13:06 ShibiHe

Thanks for the reply! It works indeed.

immars avatar Jul 24 '17 09:07 immars