ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: fp32 param and grad have different shape torch.Size([5064704]) vs torch.Size([128000]) when use lora_rank=4 at stage 1

Open boundles opened this issue 1 year ago • 6 comments

🐛 Describe the bug

Traceback (most recent call last): File "train_sft.py", line 175, in train(args) File "train_sft.py", line 146, in train train(args) File "train_sft.py", line 146, in train trainer.fit(logger=logger, log_interval=args.log_interval) File "/home/luban/.local/lib/python3.8/site-packages/coati/trainer/sft.py", line 102, in fit self.strategy.optimizer_step(self.optimizer) File "/home/luban/.local/lib/python3.8/site-packages/coati/trainer/strategies/colossalai.py", line 154, in optimizer_step optimizer.step() File "/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py", line 65, in wrapper return wrapped(*args, **kwargs) File "/home/luban/.local/lib/python3.8/site-packages/colossalai/zero/sharded_optim/low_level_optim.py", line 467, in step trainer.fit(logger=logger, log_interval=args.log_interval) File "/home/luban/.local/lib/python3.8/site-packages/coati/trainer/sft.py", line 102, in fit self.strategy.optimizer_step(self.optimizer) File "/home/luban/.local/lib/python3.8/site-packages/coati/trainer/strategies/colossalai.py", line 154, in optimizer_step assert param_shape == flat_fp32_avg_grads.shape,
AssertionError: fp32 param and grad have different shape torch.Size([5073920]) vs torch.Size([16384]) optimizer.step() File "/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py", line 65, in wrapper return wrapped(*args, **kwargs) File "/home/luban/.local/lib/python3.8/site-packages/colossalai/zero/sharded_optim/low_level_optim.py", line 467, in step assert param_shape == flat_fp32_avg_grads.shape,
AssertionError: fp32 param and grad have different shape torch.Size([5064704]) vs torch.Size([128000])

Environment

No response

boundles avatar Apr 03 '23 07:04 boundles

I am having the same issue here

alibabadoufu avatar Apr 03 '23 13:04 alibabadoufu

i have a similar issue too

│   153 │   def optimizer_step(self, optimizer: optim.Optimizer, **kwargs) -> None:                │
│ ❱ 154 │   │   optimizer.step()                                                                   │
│   155 │                                                                                          │
│   156 │   @staticmethod                                                                          │
│   157 │   def _unwrap_actor(actor: Actor) -> nn.Module:                                          │
│                                                                                                  │
│ /opt/conda/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:65 in wrapper                 │
│                                                                                                  │
│     62 │   │   │   │   instance = instance_ref()                                                 │
│     63 │   │   │   │   instance._step_count += 1                                                 │
│     64 │   │   │   │   wrapped = func.__get__(instance, cls)                                     │
│ ❱   65 │   │   │   │   return wrapped(*args, **kwargs)                                           │
│     66 │   │   │                                                                                 │
│     67 │   │   │   # Note that the returned function here is no longer a bound method,           │
│     68 │   │   │   # so attributes like `__func__` and `__self__` no longer exist.               │
│                                                                                                  │
│ /opt/conda/lib/python3.9/site-packages/colossalai/zero/sharded_optim/low_level_optim.py:467 in   │
│ step                                                                                             │
│                                                                                                  │
│   464 │   │   │   flat_fp32_avg_grads = flat_fp16_avg_grads.to(dtype)                            │
│   465 │   │   │                                                                                  │
│   466 │   │   │   param_shape = self._fp32_flat_param_groups_of_current_rank[group_id].shape     │
│ ❱ 467 │   │   │   assert param_shape == flat_fp32_avg_grads.shape, \                             │
│   468 │   │   │   │   f'fp32 param and grad have different shape {param_shape} vs {flat_fp32_a   │
│   469 │   │   │                                                                                  │
│   470 │   │   │   single_grad_partition_groups.append(flat_fp32_avg_grads)                       │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError: fp32 param and grad have different shape torch.Size([10138624]) vs torch.Size([144384]

zhangliang-04 avatar Apr 03 '23 16:04 zhangliang-04

I am having the same issue here

qinqinqaq avatar Apr 04 '23 02:04 qinqinqaq

I am having the same issue here

GongCQ avatar Apr 11 '23 10:04 GongCQ

the same issue too, how do you solve this issue ?

elven2016 avatar Apr 20 '23 03:04 elven2016

the same issue too

wangxiaobo007 avatar Apr 20 '23 10:04 wangxiaobo007

the same issue too

qianyuqiu79 avatar May 05 '23 08:05 qianyuqiu79

This error occurs when I use both --lora_rank and --grad_checkpoint. Either use --lora_rank or --grad_checkpoint.

l241025097 avatar May 30 '23 03:05 l241025097

the same issue too

zhangyuanscall avatar Jun 21 '23 04:06 zhangyuanscall

the same issue too

pvop avatar Jul 24 '23 11:07 pvop

the same issue too

pvop avatar Jul 25 '23 01:07 pvop

AssertionError: fp32 param and grad have different shape I have solved this error. I use GLM-10B to train reward model. The outputs of 'mems' is used as last_hidden_states. But the 'mems' is processed by detach which means it is removed in the computation graph. And the gradients can not penetrate to the model. Therefore, you should verify your last_hidden_states and ensure its presence in the computational graph.

pvop avatar Jul 25 '23 04:07 pvop