ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: lamb Learning in the opposite direction by big learn rate

Open xiezhipeng-git opened this issue 2 years ago • 8 comments

🐛 Describe the bug

BUg:lamb Learning in the opposite direction. Reproduce:git clone https://github.com/Lizhi-sjtu/MARL-code-pytorch in MADDPG_MATD3_main.py use simple_spread discrete False change: parser.add_argument("--lr_a", type=float, default=1e-2, help="Learning rate of actor 5e-4") parser.add_argument("--lr_c", type=float, default=1e-2, help="Learning rate of critic 5e-4") in matd3.py change: self.actor_optimizer = Lamb(self.actor.parameters(), lr=self.lr_a) self.critic_optimizer = Lamb(self.critic.parameters(), lr=self.lr_c) result: total_steps:1000 evaluate_reward:-239.7472497527612 noise_std:0.2991666666666612 runName:Lamb_elu_l1e-2c1noise0.3wave100QNClip0.5 用时:0.02381520940197839时 total_steps:2000 evaluate_reward:-219.07316555897853 noise_std:0.2983333333333224 runName:Lamb_elu_l1e-2c1noise0.3wave100QNClip0.5 用时:0.04523143476910061时 total_steps:3000 evaluate_reward:-206.3208274189844 noise_std:0.2974999999999836 runName:Lamb_elu_l1e-2c1noise0.3wave100QNClip0.5 用时:0.06718448877334594时 total_steps:4000 evaluate_reward:-201.41465567655362 noise_std:0.2966666666666448 runName:Lamb_elu_l1e-2c1noise0.3wave100QNClip0.5 用时:0.08901055402225919时 total_steps:5000 evaluate_reward:-202.76059842118968 noise_std:0.295833333333306 runName:Lamb_elu_l1e-2c1noise0.3wave100QNClip0.5 用时:0.1106544370121426时 total_steps:6000 evaluate_reward:-205.83318603841298 noise_std:0.29499999999996723 runName:Lamb_elu_l1e-2c1noise0.3wave100QNClip0.5 用时:0.13226939015918307时 total_steps:7000 evaluate_reward:-339.8071060633132 noise_std:0.29416666666662844 runName:Lamb_elu_l1e-2c1noise0.3wave100QNClip0.5 用时:0.15377979503737554时 total_steps:8000 evaluate_reward:-639.448898103806 noise_std:0.29333333333328965 runName:Lamb_elu_l1e-2c1noise0.3wave100QNClip0.5 用时:0.17607069863213434时 total_steps:9000 evaluate_reward:-647.7273337046531 noise_std:0.29249999999995085 runName:Lamb_elu_l1e-2c1noise0.3wave100QNClip0.5 用时:0.19738913575808206时 total_steps:10000 evaluate_reward:-648.6930631087723 noise_std:0.29166666666661206 runName:Lamb_elu_l1e-2c1noise0.3wave100QNClip0.5 用时:0.2186523597770267时 E:\study\machineStudy\project\My_matd3\my_matd3\MARL-code-pytorch\mpe\multiagent\core.py:192: RuntimeWarning: invalid value encountered in logaddexp penetration = np.logaddexp(0, -(dist - dist_min)/k)*k total_steps:11000 evaluate_reward:nan noise_std:0.29083333333327327 runName:Lamb_elu_l1e-2c1noise0.3wave100QNClip0.5 用时:0.2401244193315506时 And All other optim is work good in pytorch optim.

This makes me totally untrustworthy of colosalai

Environment

release 11.7, V11.7.64/8.6.0/I do't know/3.9.13/1.12.1+cu116

xiezhipeng-git avatar Dec 27 '22 05:12 xiezhipeng-git

Hi, we also observed some abnormal points in the results. Could you check them one by one?

  1. The reward is always a negative number. Is this normal?
  2. Did you try any other learning rate? The default one (1e-2) seems a bit large for the LAMB optimizer. Maybe this is why the model cannot converge.

kurisusnowdeng avatar Dec 27 '22 13:12 kurisusnowdeng

Hi, we also observed some abnormal points in the results. Could you check them one by one?

  1. The reward is always a negative number. Is this normal?
  2. Did you try any other learning rate? The default one (1e-2) seems a bit large for the LAMB optimizer. Maybe this is why the model cannot converge.
  1. It is normal. MPE env is the common environment for Multi-agent reinforcement learning. If use AdamW. you can see it use 0.01lr get-125 score in 25Wstep.
  2. I try 0.001 lr. result down.then i stop it.because adamw better it. 0.001lr AdamW also 25w step -125 score total_steps:304000 evaluate_reward:-140.0160343729058 noise_std:0.05 runName:Lamb_elu_l1e-3c1noise0.3wave100QNClip0.5 用时:11.221000528732935时 Then, even if the learning rate is higher, it also shows that the optimizer has problems. Not to mention that it was normal at the beginning, and the optimizer provided by pytorch by default is not like this

xiezhipeng-git avatar Dec 27 '22 13:12 xiezhipeng-git

@xiezhipeng-git Thank you for the wonderful experiments. It's like the LAMB optimizer can work with the 0.001 lr now, since the reward goes down to -140 (if I understand you correctly). But, there still seems to be problem you want to point out? Could you please indicate where the optimizer behaves abnormally? Appreciate it very much.

kurisusnowdeng avatar Dec 28 '22 00:12 kurisusnowdeng

@xiezhipeng-git Thank you for the wonderful experiments. It's like the LAMB optimizer can work with the 0.001 lr now, since the reward goes down to -140 (if I understand you correctly). But, there still seems to be problem you want to point out? Could you please indicate where the optimizer behaves abnormally? Appreciate it very much.

I can not.I think 0.01lr result already indicate abnormally.So I'm telling you to see if we can fix it

xiezhipeng-git avatar Dec 29 '22 14:12 xiezhipeng-git

@xiezhipeng-git I think that is just overfitting because 0.01 is a bit large (similar to this figure). Optimization algorithms may differ in their best learning rates. BTW, could you also try other LAMB implementation such as https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedLAMB to see if that can work.

kurisusnowdeng avatar Jan 03 '23 07:01 kurisusnowdeng

@xiezhipeng-git I think that is just overfitting because 0.01 is a bit large (similar to this figure). Optimization algorithms may differ in their best learning rates. BTW, could you also try other LAMB implementation such as https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedLAMB to see if that can work.

I can not to try. becuse when i install apex. 1 error detected in the compilation of "csrc/multi_tensor_axpby_kernel.cu". error: command 'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin\nvcc.exe' failed with exit code 4294967295 error: subprocess-exited-with-error

× Running setup.py install for apex did not run successfully. │ exit code: 1 ╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip. full command: '*/python.exe' -u -c ' can you try it and tell me or tell me how to use FusedLAMB

xiezhipeng-git avatar Jan 03 '23 17:01 xiezhipeng-git

@xiezhipeng-git I think that is just overfitting because 0.01 is a bit large (similar to this figure). Optimization algorithms may differ in their best learning rates. BTW, could you also try other LAMB implementation such as https://nvidia.github.io/apex/optimizers.html#apex.optimizers.FusedLAMB to see if that can work.

I can not to try. becuse when i install apex. 1 error detected in the compilation of "csrc/multi_tensor_axpby_kernel.cu". error: command 'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin\nvcc.exe' failed with exit code 4294967295 error: subprocess-exited-with-error

× Running setup.py install for apex did not run successfully. │ exit code: 1 ╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip. full command: '*/python.exe' -u -c ' can you try it and tell me or tell me how to use FusedLAMB

I think there are supposed to be many other equivalent implementations, perhaps in fairseq for example. You could look for one and try. Good luck.

kurisusnowdeng avatar Jan 04 '23 08:01 kurisusnowdeng

use fairseq also need install apex ModuleNotFoundError: No module named 'apex'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "e:\study\machineStudy\project\My_matd3\my_matd3\MARL-code-pytorch\4.MADDPG_MATD3_MPE\MADDPG_MATD3_main.py", line 293, in
runner = Runner(args, env_name=env_names[env_index], runName="fairseq_lamb_elu_l1e-2c1noise0.3wave100QNClip0.5", seed=0) File "e:\study\machineStudy\project\My_matd3\my_matd3\MARL-code-pytorch\4.MADDPG_MATD3_MPE\MADDPG_MATD3_main.py", line 52, in init
self.agent_n = [MATD3(args, agent_id) for agent_id in range(args.N)] File "e:\study\machineStudy\project\My_matd3\my_matd3\MARL-code-pytorch\4.MADDPG_MATD3_MPE\MADDPG_MATD3_main.py", line 52, in self.agent_n = [MATD3(args, agent_id) for agent_id in range(args.N)] File "e:\study\machineStudy\project\My_matd3\my_matd3\MARL-code-pytorch\4.MADDPG_MATD3_MPE\matd3.py", line 65, in init optimizer_a = build_optimizer(self.namespace_dls_a, params_a) File "E:\study\machineStudy\project\My_matd3\my_matd3\MARL-code-pytorch\fairseq\optim_init_.py", line 41, in build_optimizer return _build_optimizer(cfg, params, *extra_args, **extra_kwargs) File "E:\study\machineStudy\project\My_matd3\my_matd3\MARL-code-pytorch\fairseq\registry.py", line 65, in build_x return builder(cfg, *extra_args, **extra_kwargs) File "E:\study\machineStudy\project\My_matd3\my_matd3\MARL-code-pytorch\fairseq\optim\fused_lamb.py", line 20, in init raise ImportError("Please install apex to use LAMB optimizer") ImportError: Please install apex to use LAMB optimizer

xiezhipeng-git avatar Jan 04 '23 16:01 xiezhipeng-git

We have updated a lot. This issue was closed due to inactivity. Thanks. Here are some verified LAMB parameters. https://github.com/NUS-HPC-AI-Lab/pytorch-lamb

binmakeswell avatar Apr 14 '23 09:04 binmakeswell