ml-agents icon indicating copy to clipboard operation
ml-agents copied to clipboard

SAC-trainer loses all progress after --resume

Open fquarters opened this issue 2 years ago • 1 comments

Describe the bug Every time I resume training with SAC trainer with --resume flag, the agent performs similar as it did at the end of the previous session, but after a buffer_size number of steps the model basically forgets everything, and reward plunges (see screenshots).

To Reproduce Steps to reproduce the behavior:

  1. Train an agent with SAC until reward to stabilizes
  2. Stop training
  3. Resume training with --resume flag
  4. Wait until buffer_size number of steps passes

Console logs / stack traces

Version information:
  ml-agents: 0.28.0,
  ml-agents-envs: 0.28.0,
  Communicator API: 1.5.0,
  PyTorch: 1.7.1+cu110
[INFO] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
[INFO] Connected to Unity environment with package version 2.0.1 and communication version 1.5.0
[INFO] Connected new brain: RTS_AI?team=0
[INFO] Hyperparameters for behavior name RTS_AI:
        trainer_type:   sac
        hyperparameters:
          learning_rate:        0.0003
          learning_rate_schedule:       constant
          batch_size:   64
          buffer_size:  25000
          buffer_init_steps:    25000
          tau:  0.005
          steps_per_update:     25.0
          save_replay_buffer:   False
          init_entcoef: 0.5
          reward_signal_steps_per_update:       25.0
        network_settings:
          normalize:    False
          hidden_units: 128
          num_layers:   1
          vis_encode_type:      simple
          memory:       None
          goal_conditioning_type:       hyper
          deterministic:        False
        reward_signals:
          extrinsic:
            gamma:      0.995
            strength:   1.0
            network_settings:
              normalize:        False
              hidden_units:     128
              num_layers:       2
              vis_encode_type:  simple
              memory:   None
              goal_conditioning_type:   hyper
              deterministic:    False
          curiosity:
            gamma:      0.8
            strength:   0.1
            network_settings:
              normalize:        False
              hidden_units:     128
              num_layers:       2
              vis_encode_type:  simple
              memory:   None
              goal_conditioning_type:   hyper
              deterministic:    False
            learning_rate:      0.0003
            encoding_size:      None
        init_path:      None
        keep_checkpoints:       5
        checkpoint_interval:    500000
        max_steps:      5000000
        time_horizon:   5
        summary_freq:   1000
        threaded:       False
        self_play:      None
        behavioral_cloning:     None
[INFO] Resuming from results\harvester\RTS_AI.
[INFO] Resuming training from step 2499996.
[INFO] Parameter 'difficulty' is in lesson 'First' and has value 'Float: value=1'.
[INFO] RTS_AI. Step: 2500000. Time Elapsed: 18.829 s. Mean Reward: 0.820. Std of Reward: 0.000. Training.
[INFO] Exported results\harvester\RTS_AI\RTS_AI-2499999.onnx
[INFO] RTS_AI. Step: 2501000. Time Elapsed: 32.194 s. Mean Reward: 0.240. Std of Reward: 1.054. Training.
[INFO] RTS_AI. Step: 2502000. Time Elapsed: 44.475 s. Mean Reward: 0.180. Std of Reward: 1.108. Training.
[INFO] RTS_AI. Step: 2503000. Time Elapsed: 56.572 s. Mean Reward: 0.282. Std of Reward: 1.092. Training.
[INFO] RTS_AI. Step: 2504000. Time Elapsed: 68.882 s. Mean Reward: 0.330. Std of Reward: 1.047. Training.
----- goes like this for a while, and then:
c:\programs\virtualenv\python-envs\ml-agents\lib\site-packages\mlagents\trainers\torch\utils.py:309: UserWarning: This overload of nonzero is deprecated:
        nonzero()
Consider using one of the following signatures instead:
        nonzero(*, bool as_tuple) (Triggered internally at  ..\torch\csrc\utils\python_arg_parser.cpp:882.)
  res += [data[(partitions == i).nonzero().squeeze(1)]]
[INFO] RTS_AI. Step: 2525000. Time Elapsed: 327.188 s. Mean Reward: 0.256. Std of Reward: 1.075. Training.
[INFO] RTS_AI. Step: 2526000. Time Elapsed: 343.136 s. Mean Reward: -1.011. Std of Reward: 1.131. Training.
[INFO] RTS_AI. Step: 2527000. Time Elapsed: 358.392 s. Mean Reward: -1.507. Std of Reward: 0.241. Training.
[INFO] RTS_AI. Step: 2528000. Time Elapsed: 373.983 s. Mean Reward: -1.771. Std of Reward: 0.285. Training.
[INFO] RTS_AI. Step: 2529000. Time Elapsed: 390.218 s. Mean Reward: -1.557. Std of Reward: 0.868. Training.
[INFO] RTS_AI. Step: 2530000. Time Elapsed: 406.325 s. Mean Reward: -1.528. Std of Reward: 0.950. Training.
[INFO] RTS_AI. Step: 2531000. Time Elapsed: 421.682 s. Mean Reward: -1.702. Std of Reward: 0.691. Training.

Screenshots image

Environment (please complete the following information):

  • Unity Version: 2022.1.13f1
  • OS + version: Windows 10
  • ML-Agents version: 2.0.1
  • Torch version: 1.7.1+cu110
  • Environment: not an example environment, but I have a suspicion the environment doesn't matter much in this case.

Initially I thought this was due to the buffer being mostly empty (when buffer_init_steps < buffer_size in yaml) making agent to perform badly, but as you can see here, even if buffer_init_steps == buffer_size, the problem persists. I've seen one topic about the same problem on unity forums, but no one responded there, sadly. Is this even a bug? Could it be related to that deprecation warning? Any ideas?

fquarters avatar Aug 22 '22 21:08 fquarters

This issue has been automatically marked as stale because it has not had activity in the last 28 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 21 '22 02:09 stale[bot]

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 08 '23 04:01 stale[bot]

Any updates about this case yet?

OmarVector avatar Feb 14 '23 09:02 OmarVector

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 18 '23 11:06 stale[bot]

I have been able to reproduce this and also found a workaround.

When a SAC training run is resumed, the initial entropy coefficient (init_entcoef) is set to whatever the value is in the config file. If it is not set in the config file, it is set to the default value (1.0). The initial entropy is then used to reset the current policy entropy coefficient in the actual training run after resuming.

Since entropy is a measure of randomness/exploration this usually causes the policy to deteriorate after resuming.

By setting the init_entcoef to whatever was logged last logged in tensorboard under "Policy /Continuous Entropy Coeff", I was able to resume without any issues.

My suggestion is, that if the config file defines the same initial entropy coefficent as the currently stored (previous) config file or does not define it at all, the resume should probably use whatever entropy coefficent was last seen before saving.

martkartasev avatar Sep 20 '23 07:09 martkartasev

I also noticed another thing, which I believe is probably happening in the original example above as well.

After resuming, the coefficient sometimes stops being logged. If we look at the coefficient in the screenshot, we see, judging by the gridlines, that it goes up until about 2.4 million or so.

The training log shows that that training was resumed at around 2.5 million. Means that here the coefficient is not actually shown properly in the screenshot. In some cases it does get logged properly, thats how I even got the idea for this in the first place, but its possibly a separate bug that made this problem even less obvious.

However, we do see the curiosity and entropy (not coefficient) shoot way up, which by my estimation explains the deterioration in this example quite well. Those in turn would be caused by the coefficient which is not logged correctly.

martkartasev avatar Sep 20 '23 07:09 martkartasev