ml-agents
ml-agents copied to clipboard
SAC-trainer loses all progress after --resume
Describe the bug
Every time I resume training with SAC trainer with --resume flag, the agent performs similar as it did at the end of the previous session, but after a buffer_size
number of steps the model basically forgets everything, and reward plunges (see screenshots).
To Reproduce Steps to reproduce the behavior:
- Train an agent with SAC until reward to stabilizes
- Stop training
- Resume training with --resume flag
- Wait until
buffer_size
number of steps passes
Console logs / stack traces
Version information:
ml-agents: 0.28.0,
ml-agents-envs: 0.28.0,
Communicator API: 1.5.0,
PyTorch: 1.7.1+cu110
[INFO] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
[INFO] Connected to Unity environment with package version 2.0.1 and communication version 1.5.0
[INFO] Connected new brain: RTS_AI?team=0
[INFO] Hyperparameters for behavior name RTS_AI:
trainer_type: sac
hyperparameters:
learning_rate: 0.0003
learning_rate_schedule: constant
batch_size: 64
buffer_size: 25000
buffer_init_steps: 25000
tau: 0.005
steps_per_update: 25.0
save_replay_buffer: False
init_entcoef: 0.5
reward_signal_steps_per_update: 25.0
network_settings:
normalize: False
hidden_units: 128
num_layers: 1
vis_encode_type: simple
memory: None
goal_conditioning_type: hyper
deterministic: False
reward_signals:
extrinsic:
gamma: 0.995
strength: 1.0
network_settings:
normalize: False
hidden_units: 128
num_layers: 2
vis_encode_type: simple
memory: None
goal_conditioning_type: hyper
deterministic: False
curiosity:
gamma: 0.8
strength: 0.1
network_settings:
normalize: False
hidden_units: 128
num_layers: 2
vis_encode_type: simple
memory: None
goal_conditioning_type: hyper
deterministic: False
learning_rate: 0.0003
encoding_size: None
init_path: None
keep_checkpoints: 5
checkpoint_interval: 500000
max_steps: 5000000
time_horizon: 5
summary_freq: 1000
threaded: False
self_play: None
behavioral_cloning: None
[INFO] Resuming from results\harvester\RTS_AI.
[INFO] Resuming training from step 2499996.
[INFO] Parameter 'difficulty' is in lesson 'First' and has value 'Float: value=1'.
[INFO] RTS_AI. Step: 2500000. Time Elapsed: 18.829 s. Mean Reward: 0.820. Std of Reward: 0.000. Training.
[INFO] Exported results\harvester\RTS_AI\RTS_AI-2499999.onnx
[INFO] RTS_AI. Step: 2501000. Time Elapsed: 32.194 s. Mean Reward: 0.240. Std of Reward: 1.054. Training.
[INFO] RTS_AI. Step: 2502000. Time Elapsed: 44.475 s. Mean Reward: 0.180. Std of Reward: 1.108. Training.
[INFO] RTS_AI. Step: 2503000. Time Elapsed: 56.572 s. Mean Reward: 0.282. Std of Reward: 1.092. Training.
[INFO] RTS_AI. Step: 2504000. Time Elapsed: 68.882 s. Mean Reward: 0.330. Std of Reward: 1.047. Training.
----- goes like this for a while, and then:
c:\programs\virtualenv\python-envs\ml-agents\lib\site-packages\mlagents\trainers\torch\utils.py:309: UserWarning: This overload of nonzero is deprecated:
nonzero()
Consider using one of the following signatures instead:
nonzero(*, bool as_tuple) (Triggered internally at ..\torch\csrc\utils\python_arg_parser.cpp:882.)
res += [data[(partitions == i).nonzero().squeeze(1)]]
[INFO] RTS_AI. Step: 2525000. Time Elapsed: 327.188 s. Mean Reward: 0.256. Std of Reward: 1.075. Training.
[INFO] RTS_AI. Step: 2526000. Time Elapsed: 343.136 s. Mean Reward: -1.011. Std of Reward: 1.131. Training.
[INFO] RTS_AI. Step: 2527000. Time Elapsed: 358.392 s. Mean Reward: -1.507. Std of Reward: 0.241. Training.
[INFO] RTS_AI. Step: 2528000. Time Elapsed: 373.983 s. Mean Reward: -1.771. Std of Reward: 0.285. Training.
[INFO] RTS_AI. Step: 2529000. Time Elapsed: 390.218 s. Mean Reward: -1.557. Std of Reward: 0.868. Training.
[INFO] RTS_AI. Step: 2530000. Time Elapsed: 406.325 s. Mean Reward: -1.528. Std of Reward: 0.950. Training.
[INFO] RTS_AI. Step: 2531000. Time Elapsed: 421.682 s. Mean Reward: -1.702. Std of Reward: 0.691. Training.
Screenshots
Environment (please complete the following information):
- Unity Version: 2022.1.13f1
- OS + version: Windows 10
- ML-Agents version: 2.0.1
- Torch version: 1.7.1+cu110
- Environment: not an example environment, but I have a suspicion the environment doesn't matter much in this case.
Initially I thought this was due to the buffer being mostly empty (when buffer_init_steps < buffer_size in yaml) making agent to perform badly, but as you can see here, even if buffer_init_steps == buffer_size, the problem persists. I've seen one topic about the same problem on unity forums, but no one responded there, sadly. Is this even a bug? Could it be related to that deprecation warning? Any ideas?
This issue has been automatically marked as stale because it has not had activity in the last 28 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days if no further activity occurs. Thank you for your contributions.
Any updates about this case yet?
This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days if no further activity occurs. Thank you for your contributions.
I have been able to reproduce this and also found a workaround.
When a SAC training run is resumed, the initial entropy coefficient (init_entcoef) is set to whatever the value is in the config file. If it is not set in the config file, it is set to the default value (1.0). The initial entropy is then used to reset the current policy entropy coefficient in the actual training run after resuming.
Since entropy is a measure of randomness/exploration this usually causes the policy to deteriorate after resuming.
By setting the init_entcoef to whatever was logged last logged in tensorboard under "Policy /Continuous Entropy Coeff", I was able to resume without any issues.
My suggestion is, that if the config file defines the same initial entropy coefficent as the currently stored (previous) config file or does not define it at all, the resume should probably use whatever entropy coefficent was last seen before saving.
I also noticed another thing, which I believe is probably happening in the original example above as well.
After resuming, the coefficient sometimes stops being logged. If we look at the coefficient in the screenshot, we see, judging by the gridlines, that it goes up until about 2.4 million or so.
The training log shows that that training was resumed at around 2.5 million. Means that here the coefficient is not actually shown properly in the screenshot. In some cases it does get logged properly, thats how I even got the idea for this in the first place, but its possibly a separate bug that made this problem even less obvious.
However, we do see the curiosity and entropy (not coefficient) shoot way up, which by my estimation explains the deterioration in this example quite well. Those in turn would be caused by the coefficient which is not logged correctly.