stable-baselines3-contrib icon indicating copy to clipboard operation
stable-baselines3-contrib copied to clipboard

[Question] Why env_change[batch_inds] is not considered during _get_samples(*) in RecurrentRolloutBuffer?

Open Hhannzch opened this issue 9 months ago • 6 comments

❓ Question

When getting batches from a well-collected RecurrentRolloutBuffer, only episode_starts[batch_inds] will be returned to the sequence data. And this "episode_starts" is important for lstm policy to reset the hidden state during the training. However, I have a question about the behavior here. As the seq_start_indices are decided together by both episode_starts and env_change, why are only episode_starts returned? To be more clear, why the line 240 in common.recurrent.buffers is like "episode_starts=self.pad_and_flatten(self.episode_starts[batch_inds])" instead of "episode_starts=self.pad_and_flatten(self.episode_starts[batch_inds] or env_change[batch_inds])"?

Thank you for the explanation in advance.

Checklist

Hhannzch avatar Mar 11 '25 14:03 Hhannzch

Hello, thanks for pointing that out. Might be a bug. I need to dig deeper, this code is overly complex even with all the comments 🙈

araffin avatar Mar 13 '25 15:03 araffin

To refresh my memory, I've created some graphics (I should probably put them in SB3 doc later).

First, we collect n_steps * n_envs transitions and then flatten it to be able to sample from it:

Image

Optionally, we split randomly the flattened sequences to add a bit of randomness when sampling for multiple epochs:

Image

We need to zero-out the hidden state of LSTM whenever a new sequence starts (so episode starts or env change event).

Then we take batch_size transitions and reshape by sequences to be able to pad them:

Image

Note: we use the flattened sequence to compute things like advantage but the last operation will be reverted before the mini-batch is fed to the LSTM (which expects (max length, n_seq, features_dim) as input)

araffin avatar Mar 31 '25 15:03 araffin

To be more clear, why the line 240 in common.recurrent.buffers is like "episode_starts=self.pad_and_flatten(self.episode_starts[batch_inds])" instead of "episode_starts=self.pad_and_flatten(self.episode_starts[batch_inds] or env_change[batch_inds])"?

After looking at it, it should probably be like: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/00a401db2c0bcfe8410fba2c4df1d001909e59e3/sb3_contrib/common/recurrent/buffers.py#L82

We should also double check the effect of the split on it...

araffin avatar Mar 31 '25 15:03 araffin

This looks like a feature, not a bug. We will cut multiple sub-sequences, but only a few will be actual episode starts. For those that are actual starts we will later reset the hidden state (on a first sequence step). For those that aren't starts we will use stored hidden states instead of resetting them on the first seq step.

ar-too avatar Apr 16 '25 14:04 ar-too

fyi, I created a fix some weeks ago: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/pull/290

but I didn't have much time to benchmark it (so far, I couldn't see any significant changes), so if someone has time to benchmark (using the no vel envs from the RL Zoo, see SB3 contrib doc), feel free to report the results here =)

araffin avatar Aug 01 '25 15:08 araffin

We need to zero-out the hidden state of LSTM whenever a new sequence starts (so episode starts or env change event).

Do we need to reset the hidden state when env change? afaik we are not resetting the env after collecting the rollouts so s0 in episode 3 can be a state in middle of a rollout, right?

on a side note if we are creating a new sequence for every episode start do we even need these lines. can we further optimise these lines, as there won't be any case where episode_starts[1:] is 1, during training.

suijth avatar Aug 02 '25 19:08 suijth