stable-baselines3 N-step updates for off-policy methods

Changes

OffPolicyAlgorithm

Now accepts buffer class as argument type
- Rationale: allows for other types of replay buffers besides the current implementations
Now accepts arguments for the replay buffer
- Rationale: makes it easier to pass arguments because we don't need to check every scenario.
- I haven't moved buffer_size in it

NStepReplayBuffer

Derives ReplayBuffer and adds two more arguments, n_step and gamma.
Overrides _get_samples(self, batch_inds, env) -> ReplayBufferSamples to compute NStepReturns.
- Computation is done purely through numpy without python loops. This makes the implementation easier and faster.

Computing Nstep returns

Find correct transition indices
Create a 2d mask for the rewards using done
Filter the mask so that self.pos is ignored because it contains older experience.
Multiply the filter with the reward, the discounting, and sum across.

Motivation and Context

[x] I have raised an issue to propose this change (required for new features and bug fixes)

Closes #47

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[x] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Documentation (update in the documentation)

Checklist:

[x] I've read the CONTRIBUTION guide (required)
[x] I have updated the changelog accordingly (required).
[x] My change requires a change to the documentation.
[x] I have updated the tests accordingly (required for a bug fix or a new feature).
[ ] I have updated the documentation accordingly.
[x] I have checked the codestyle using make lint
[x] I have ensured make pytest and make type both pass.

Nstep Implementation details

Jun 30 '20 15:06 PartiallyUntyped

Funny, I also recently gave it a try here: https://github.com/DLR-RM/stable-baselines3/tree/feat/n-steps

Jun 30 '20 15:06 araffin

The implementation appears mostly the same. I would like to add the class and the arguments as parameters to the OffPolicyAlgorithm along with a variable named replay_args, in a similar fashion to how policies are passed. This will make it easy to add other methods as well. IIRC energy based replay buffer was also considered at some point.

Jun 30 '20 16:06 PartiallyUntyped

@araffin So I was working on this and there's a bug with both your and my implementations. The bug is due to % self.buffer_size.

https://github.com/DLR-RM/stable-baselines3/blob/41eb0cb37e8ac35413c0b11d0d5bd5cfbd2c92e2/stable_baselines3/common/buffers.py#L441-L454

The bug is basically the same as with memory optimization. To be exact, going around the buffer (% self.buffer_size) gives no quarantine that the transition is part of the same trajectory.

My implementation is WIP and there's a couple of issues that I am working on.

** Edit: **

Completed it. Removed loops, added tests, using just numpy arrays, no loops. Should be quite fast.

Jul 04 '20 09:07 PartiallyUntyped

The bug is basically the same as with memory optimization.

:see_no_evil: Yep, I did only quick test with it and could not see any improvement yet.

Completed it. Removed loops, added tests, using just numpy arrays, no loops. Should be quite fast.

Sounds good =)

Ping me when you want a review

Jul 10 '20 08:07 araffin

I merged master and removed the generic from offpolicyalgorithm since it wasn't discussed. I think it's ready for a review.

Ps. sorry for autoformatting >.< ...

Jul 10 '20 08:07 PartiallyUntyped

Ps. sorry for autoformatting >.< ...

I will do a PR soon for that ;) Apparently will be with black.

Jul 10 '20 08:07 araffin

On a quick glimpse it looks good, like arrafin already mentioned, but I would be even more exhaustive with the comments. The numpy tricks are compact and can be performant, but the code becomes harder to understand. I would also include the shape of arrays in the comments with a definition of contents, e.g. # (batch_size, num_actions) of {0,1}.

Jul 11 '20 15:07 Miffyli

I'm currently given a quick try to that one (feat/n-steps in the zoo), and it already yields some interesting results with DQN on CartPole ;) And I couldn't notice any slow down.

The command:

python train.py --algo dqn --env CartPole-v1 -params replay_buffer_class:NstepReplayBuffer replay_buffer_kwargs:"dict(n_step=4)" --num-threads 2 -tb /tmp/sb3

The two first curves are with n_step=4 (much more stable learning) and the two others are with n_step=1

dqn_n_step

I will try with SAC soon.

Aug 05 '20 10:08 araffin

Looks good! What about the FPS? It should have very small impact, but, there are still some optimizations that can be made.

Aug 05 '20 10:08 PartiallyUntyped

Looks good! What about the FPS? It should have very small impact, but, there are still some optimizations that can be made.

On DQN with CartPole, as mentioned, I couldn't notice the difference (all runs where around 1000FPS). For SAC, experiments are currently running but there is no difference for now =)

Aug 05 '20 11:08 araffin

For sac, I am not sure if n-steps can be applied directly as I am under the impression that the backup requires the entropy for the intermediate states as well, not just the last one.

Aug 05 '20 11:08 PartiallyUntyped

For sac, I am not sure if n-steps can be applied directly as I am under the impression that the backup requires the entropy for the intermediate states as well, not just the last one.

Good point, for now I'm just checking quickly assuming the entropy term is neglectable compared to the reward term. I could see a positive impact on LunarLanderContinuous but not on PyBullet envs yet (probably because the reward is quite shaped).

The 3 runs with mean reward > 200 are the one with the n-step replay buffer:

sac_lunar

Aug 05 '20 12:08 araffin

I added a sketch of how it would look like for SAC, it fits in ~10 lines. We would need to allocate one more array log_prob of size buffer_size and pass the actor to the buffer. I also did a small pre-processing step to remove duplicated indices when predicting the log_prob.

Aug 05 '20 13:08 araffin

If you have plans for other n_step methods, (I'd love to have retrace), then, we can sketch out how those would work (most of which require action probabilities) and figure out how to incorporate sac there as well, I was thinking of something along the lines of adding an entropy coefficient like so:

# compute rewards, entropy and other weighting fractions onto rewards,
rewards = (rewards + entr_coef * entropy) @ weighting

If we go with something like this, because we will be computing action probabilities anyway, we would have the entropy for (almost) free and it will be easier to add everything together. It will also open a path towards other soft methods.

Aug 06 '20 08:08 PartiallyUntyped

'd love to have retrace

I would need to look at that again. but yes, it would be nice to have an eligibility trace estimate style (i.e. TD(lambda) style).

The weighting is already there with the gammas, no?

we would have the entropy for (almost) free

?

For me the main issue is how to compute action log proba in a clean way? (passing the actor to the sample method, as we do when we normalize the rewards seems the cleanest for now)

Aug 07 '20 11:08 araffin

Btw, n-step does not help for now on PyBullet envs (I think to think why, maybe some tuning is needed) but it improves results and stability on env like LunarLander or BipedalWalkerHardcore (I could finally get the maximum performance with SAC).

Aug 07 '20 11:08 araffin

The weighting is already there with the gammas, no?

In retrace(lambda), treebackup(lambda), Q(lambda), etc, the weighting also uses some other factor besides gamma, thus I added it in a single term.

we would have the entropy for (almost) free

?

To compute Retrace, TreeBackup and Importance Sampling, we need the action probabilities. If we have the action probabilities, we also have the entropy (for free), thus if we implement any of the above, we will also have a clear path towards implementing n-steps with entropy.

For me the main issue is how to compute action log proba in a clean way? (passing the actor to the sample method, as we do when we normalize the rewards seems the cleanest for now)

I was also thinking of passing a function instead of the actor directly but the two are more or less the same.

Aug 07 '20 15:08 PartiallyUntyped

In retrace(lambda), treebackup(lambda), Q(lambda), etc, the weighting also uses some other factor besides gamma, thus I added it in a single term.

yes, I meant that gamma played already the weighting role.

I was also thinking of passing a function instead of the actor

sounds better, yes ;)

EDIT: if you don't pass the actor, you still need to know the device (see issue #111 )

Aug 07 '20 15:08 araffin

Yeah makes sense, I assume the function will be of the form np.ndarray -> np.ndarray, so as to make it agnostic of the device.

Aug 07 '20 16:08 PartiallyUntyped

Yeah makes sense, I assume the function will be of the form np.ndarray -> np.ndarray, so as to make it agnostic of the device.

Good idea

Aug 11 '20 10:08 araffin

So how should we continue from here? Since n_steps was planned for v1.1, I can take the time to add extensions and draft a more generic version of this that also includes entropy.

Aug 11 '20 12:08 PartiallyUntyped

Yes, sounds good ;) I would wait for support of all algorithms anyway to avoid inconsistencies. But if it is ready and fully tested, we can merge it with master.

Aug 11 '20 12:08 araffin

I don't have anything to add to NStepReplay, or any other tests. So feel free to merge this. In the mean time, I can start drafting the next step (hah, not intended).

Aug 11 '20 12:08 PartiallyUntyped

I don't have anything to add to NStepReplay, or any other tests.

So the entropy is not part of it? Maybe add an example in the doc.

Aug 11 '20 13:08 araffin

I’d rather have a separate implementation that includes entropy but it’s up to you.

Yes I will add an example in the docs.

I don't have anything to add to NStepReplay, or any other tests.

So the entropy is not part of it? Maybe add an example in the doc.

Aug 11 '20 13:08 PartiallyUntyped

I’d rather have a separate implementation that includes entropy but it’s up to you.

Hmm, the issue is that if we merge it now, it will be half-working only (working for DQN, TD3 but not for SAC).

Yes I will add an example in the docs.

yes, please =)

Aug 24 '20 09:08 araffin

Fair point! I have a couple of weeks before grad classes pick up the pace so hopefully I will have everything tied up before then.

Aug 29 '20 16:08 PartiallyUntyped

@araffin I am considering the following changes:

Use a component called say "batch generator" that given the indices produces the batch.
Instead of subclassing the buffer class for the different buffers, just pass buffer_kwargs, the constructor of the buffer will then create the correct component.

The component can be anything from simple n_steps to n_steps with entropy.

This removes the need to create all buffer combinations, e.g. if we were to have prioritized replay, we wouldn't need to create n_step_prioritized_replay.

What do you think?

Aug 29 '20 16:08 PartiallyUntyped

Use a component called say "batch generator" that given the indices produces the batch.

How is that different from _get_samples() ?

Instead of subclassing the buffer class for the different buffers, just pass buffer_kwargs, the constructor of the buffer will then create the correct component.

Not sure what you mean, it sounds like a _get_samples() with a lot of branching?

Aug 29 '20 18:08 araffin

How is that different from _get_samples() ?

_get_samples() is bound to the class of the buffer. I would like to use composition instead of sub-classing the buffer to create new _get_samples().

very rough example


class ReplayBuffer(BaseBuffer):
     def __init__(self, ..., batch_method, **batch_kwargs):
          self._sampler = Sampler.make(batch_method, **batch_kwargs) # Will create the correct sampler depending on args
          # make is a class method.

The consequence here is that we can do:

env = ...
agent_classes = [TD3, SAC] # SAC will add `use_entropy=True` in `buffer_kwargs` unless specified otherwise
for cls in agent_classes:
    agent = cls(env, buffer_kwargs={'batch_method':'nsteps', 'nsteps':5}) # all these can be trivially logged because they are strings and ints

The pros are:

We remove the need to provide the buffer class
The sampler can be swapped on the fly
We don't need to create multiple classes for buffers that just combine other classes, e.g. for Nstep+PER, we will not have to add NStepPer, and EntropyNStepPer, for Retrace(lambda) we will not have to add RetracePer, EntropicRetracePer
We can make it trivial to create new sampler classes and register them.
SAC will not actually require a new buffer class for itself that would need to be specified in its arguments to ensure correctness.
The number of classes is linear to the number of sampler methods.

The cons are:

More types
Need to write a function to construct the correct sampler (fairly easy)

Aug 29 '20 19:08 PartiallyUntyped