stable-baselines3-contrib
stable-baselines3-contrib copied to clipboard
Implemented CrossQ
This PR implements CrossQ (https://openreview.net/pdf?id=PczQtTsTIX), a novel off-policy deep RL algorithm that carefully uses batch normalisation and removes target networks to achieve state-of-the-art sample efficiency at a much lower computational complexity, as it does not require large update-to-data-ratios.
Description
This implementation is a PyTorch implementation based on the original JAX implementation (https://github.com/adityab/CrossQ). The following plot shows that the performance matches the performance reported in the original paper, as well as the performance of the open source SBX implementation provided by the authors (evaluated on 10 seeds).
Context
- [x] I have raised an issue to propose this change (required) closes #238
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [x] Documentation (update in the documentation)
Checklist:
- [x] I've read the CONTRIBUTION guide (required)
- [x] The functionality/performance matches that of the source (required for new training algorithms or training-related features).
- [x] I have updated the tests accordingly (required for a bug fix or a new feature).
- [x] I have included an example of using the feature (required for new features).
- [x] I have included baseline results (required for new training algorithms or training-related features).
- [x] I have updated the documentation accordingly.
- [ ] I have updated the changelog accordingly (required).
- [x] I have reformatted the code using
make format
(required) - [x] I have checked the codestyle using
make check-codestyle
andmake lint
(required) - [x] I have ensured
make pytest
andmake type
both pass. (required)
Note: we are using a maximum length of 127 characters per line
@araffin in my initial PR it seams like one code style check was failing, sorry about that. I fixed it and it passes on my machine now. I hope it will go through now :)
Thanks a lot for the implementation =)
I'll try later in the week, but how is it in term of runtime? (SAC vs CrossQ in PyTorch)
No worries :)
I just pushed most things you requested. I'll add some more specific responses directly to the questions above.
how is it in term of runtime? (SAC vs CrossQ in PyTorch)
It seems to be quite a but slower than the SAC baseline (and the JAX implementation as well). for 4M steps, SAC HumanoidStandup took around 12 hours whereas CrossQ took 22 hours. Not sure if there are some PyTorch implementation details that could help with speed.
I'm suspecting something is wrong with the current implementation (I'm currently investigating if it is my changes or not). My setting:
BipedalWalker-v3:
n_timesteps: !!float 2e5
policy: 'MlpPolicy'
buffer_size: 300000
gamma: 0.98
learning_starts: 10000
policy_kwargs: "dict(net_arch=dict(pi=[256, 256], qf=[1024, 1024]))"
With the RL Zoo cli for both SBX and SB3 (see SBX readme to have support)
python train.py --algo crossq --env BipedalWalker-v3 -P --verbose 0 -param n_envs:30 gradient_steps:30 -n 200000
I'm getting much better results with SBX... I hope it is not the Adam parameters.
Did you figure out what the issue is? I was at ICRA until last week so I didn't have time but if you didn't find it yet I can also have a look.
Before I pushed my last commit I benchmarked it and there the results looked as expected.
Did you figure out what the issue is? I was at ICRA until last week so I didn't have time but if you didn't find it yet I can also have a look.
not yet, I was on holidays...
Before I pushed my last commit I benchmarked it and there the results looked as expected.
I mostly observed the discrepancy on the provided env BipedalWalker-v3
and it seems to be there before my changes.
For the other, I didn't have time yet to launch full benchmark.
One difference currently is the optimizer implementation/arguments, I hope it is the one responsible for it.
nevermind, I did some more systematic tests and I couldn't see any significant difference, the implementation looks good =)
Report: https://wandb.ai/openrlbenchmark/sb3-contrib/reports/SB3-Contrib-CrossQ--Vmlldzo4NTE2MTEx
Awesome, let me know if you need anything else :)
let me know if you need anything else :)
sure, I need to find some time to go over it and maybe polish things here and there. I will probably delay the custom layer integrated in SB3 to later.
I simplified the network creation (need https://github.com/DLR-RM/stable-baselines3/pull/1975 to be merged with master), added the updated beta for Adam (it had an impact on my small experiments with Pendulum) and fixed a wrong default value for BN momentum that I introduced.