[Question] Why env_change[batch_inds] is not considered during _get_samples(*) in RecurrentRolloutBuffer?
❓ Question
When getting batches from a well-collected RecurrentRolloutBuffer, only episode_starts[batch_inds] will be returned to the sequence data. And this "episode_starts" is important for lstm policy to reset the hidden state during the training. However, I have a question about the behavior here. As the seq_start_indices are decided together by both episode_starts and env_change, why are only episode_starts returned? To be more clear, why the line 240 in common.recurrent.buffers is like "episode_starts=self.pad_and_flatten(self.episode_starts[batch_inds])" instead of "episode_starts=self.pad_and_flatten(self.episode_starts[batch_inds] or env_change[batch_inds])"?
Thank you for the explanation in advance.
Checklist
- [x] I have checked that there is no similar issue in the repo
- [x] I have read the documentation
- [x] If code there is, it is minimal and working
- [ ] If code there is, it is formatted using the markdown code blocks for both code and stack traces.
Hello, thanks for pointing that out. Might be a bug. I need to dig deeper, this code is overly complex even with all the comments 🙈
To refresh my memory, I've created some graphics (I should probably put them in SB3 doc later).
First, we collect n_steps * n_envs transitions and then flatten it to be able to sample from it:
Optionally, we split randomly the flattened sequences to add a bit of randomness when sampling for multiple epochs:
We need to zero-out the hidden state of LSTM whenever a new sequence starts (so episode starts or env change event).
Then we take batch_size transitions and reshape by sequences to be able to pad them:
Note: we use the flattened sequence to compute things like advantage but the last operation will be reverted before the mini-batch is fed to the LSTM (which expects (max length, n_seq, features_dim) as input)
To be more clear, why the line 240 in common.recurrent.buffers is like "episode_starts=self.pad_and_flatten(self.episode_starts[batch_inds])" instead of "episode_starts=self.pad_and_flatten(self.episode_starts[batch_inds] or env_change[batch_inds])"?
After looking at it, it should probably be like: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/00a401db2c0bcfe8410fba2c4df1d001909e59e3/sb3_contrib/common/recurrent/buffers.py#L82
We should also double check the effect of the split on it...
This looks like a feature, not a bug. We will cut multiple sub-sequences, but only a few will be actual episode starts. For those that are actual starts we will later reset the hidden state (on a first sequence step). For those that aren't starts we will use stored hidden states instead of resetting them on the first seq step.
fyi, I created a fix some weeks ago: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/pull/290
but I didn't have much time to benchmark it (so far, I couldn't see any significant changes), so if someone has time to benchmark (using the no vel envs from the RL Zoo, see SB3 contrib doc), feel free to report the results here =)
We need to zero-out the hidden state of LSTM whenever a new sequence starts (so episode starts or env change event).
Do we need to reset the hidden state when env change? afaik we are not resetting the env after collecting the rollouts so s0 in episode 3 can be a state in middle of a rollout, right?
on a side note if we are creating a new sequence for every episode start do we even need these lines. can we further optimise these lines, as there won't be any case where episode_starts[1:] is 1, during training.