Mava [BUG] Fix termination vs truncation mixup

[BUG] Fix termination vs truncation mixup

Open sash-a opened this issue 7 months ago • 5 comments

Describe the bug

Seem like we are not using the correct termination vs truncation values, we're always using the condition termination or truncation (timestep.last()) when we often want to use the condition of only termination (1 - discount). It especially tricky in the recurrent systems.

Expected behavior

What we should do is that when calculating advantages we should use termination (1 - discount) and in the recurrent systems when passing inputs to the networks during training we should use termination or truncation in order to correctly reset the hidden state.

Possible Solution

Always put 1-discount in the PPOTimestep.done and always put timestep.last() in the RNNLearnerState.done.

To avoid issues like this in future I think we should rename RnnLearnerState.done to RnnLearnerState.truncation.

Looks like there are a couple places where we use PPOTimestep.done when it should be RNNLearnerState.done so we'd have to go through and make sure we're always using the correct one. An example is here and here where we're using the PPOTimestep.done (which should be 1 - discount) in order to reset the hidden state, instead we should pass in RnnLearnerState.truncation to the loss functions and use that.

Nov 24 '23 07:11 sash-a

Mava Mava copied to clipboard

[BUG] Fix termination vs truncation mixup

Describe the bug

Expected behavior

Possible Solution

Mava
Mava copied to clipboard