acme
acme copied to clipboard
Wrong recurrent state accessed in R2D2 Learner
In R2D2 Learner you sample learning trajectories from Reverb in such a format that at some index t you have observation x_t, action a_t, reward r_t, and recurrent state of LSTM network state_t that the network had after processing the observation x_t. Doesn't this mean that when you apply unroll in the learner (link to code) you use LSTM state from one step in the future, effectively leaking information from the future state?