stable-baselines3-contrib
stable-baselines3-contrib copied to clipboard
gSDE noise sampling with TQC can raise ValueError due to nan in `log_std`
Description
In some rare cases, (encountered once) noise sampling in gSDE can break. It happened once with TQC on HalfCheetahBulletEnv-v0, after 800k timesteps. For some reason, the entropy loss diverged. Might be related to https://github.com/DLR-RM/rl-baselines3-zoo/issues/322
Run detailed here: https://wandb.ai/openrlbenchmark/sb3/runs/27cez5ua
To reproduce:
python -m rl_zoo3.train --algo tqc --env Ant-v3 --eval-episodes 20 --n-eval-envs 5 --seed 2609763199
Traceback (most recent call last):
File "/gpfslocalsup/pub/anaconda-py3/2021.05/envs/python-3.9.12/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/gpfslocalsup/pub/anaconda-py3/2021.05/envs/python-3.9.12/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/gpfsdswork/projects/rech/uli/upf82sp/rl-baselines3-zoo/rl_zoo3/train.py", line 283, in <module>
train()
File "/gpfsdswork/projects/rech/uli/upf82sp/rl-baselines3-zoo/rl_zoo3/train.py", line 276, in train
exp_manager.learn(model)
File "/gpfsdswork/projects/rech/uli/upf82sp/rl-baselines3-zoo/rl_zoo3/exp_manager.py", line 235, in learn
model.learn(self.n_timesteps, **kwargs)
File "/gpfsdswork/projects/rech/uli/upf82sp/stable-baselines3-contrib/sb3_contrib/tqc/tqc.py", line 296, in learn
return super().learn(
File "/gpfsdswork/projects/rech/uli/upf82sp/stable-baselines3/stable_baselines3/common/off_policy_algorithm.py", line 353, in learn
self.train(batch_size=self.batch_size, gradient_steps=gradient_steps)
File "/gpfsdswork/projects/rech/uli/upf82sp/stable-baselines3-contrib/sb3_contrib/tqc/tqc.py", line 205, in train
self.actor.reset_noise()
File "/gpfsdswork/projects/rech/uli/upf82sp/stable-baselines3-contrib/sb3_contrib/tqc/policies.py", line 142, in reset_noise
self.action_dist.sample_weights(self.log_std, batch_size=batch_size)
File "/gpfsdswork/projects/rech/uli/upf82sp/stable-baselines3/stable_baselines3/common/distributions.py", line 504, in sample_weights
self.weights_dist = Normal(th.zeros_like(std), std)
File "/gpfsdswork/projects/rech/uli/upf82sp/env_benchmark/lib/python3.9/site-packages/torch/distributions/normal.py", line 56, in __init__
super(Normal, self).__init__(batch_shape, validate_args=validate_args)
File "/gpfsdswork/projects/rech/uli/upf82sp/env_benchmark/lib/python3.9/site-packages/torch/distributions/distribution.py", line 56, in __init__
raise ValueError(
ValueError: Expected parameter scale (Tensor of shape (300, 6)) of distribution Normal(loc: torch.Size([300, 6]), scale: torch.Size([300, 6])) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
tensor([[ nan, 0.0074, 0.0030, 0.0102, 0.0134, 0.0056],
[ nan, 0.0036, 0.0056, 0.0066, 0.0092, 0.0084],
[ nan, 0.0025, 0.0013, 0.0026, 0.0016, 0.0014],
...,
[ nan, 0.0027, 0.0031, 0.0028, 0.0023, 0.0030],
[ nan, 0.0073, 0.0029, 0.0083, 0.0040, 0.0053],
[ nan, 0.0036, 0.0014, 0.0052, 0.0019, 0.0019]],
grad_fn=<ExpBackward0>)
System Info
- Describe how the library was installed (pip, docker, source, ...)
- Stable-Baselines3: 1.8.0a3
- sb3-contrib: 1.7.0
- GPU models and configuration : no gpu
- Python version: 3.9.12
- PyTorch version: 1.13
- Gym version: 0.21.0
In some rare cases, (encountered once) noise sampling in gSDE can break.
i think we need to activate use_expln=True, this should prevent log std explosion, or use AdamW yes.