ColossalAI [BUG]: fp32 param and grad have different shape torch.Size([5064704]) vs torch.Size([128000]) when use lora

🐛 Describe the bug

Traceback (most recent call last): File "train_sft.py", line 175, in train(args) File "train_sft.py", line 146, in train train(args) File "train_sft.py", line 146, in train trainer.fit(logger=logger, log_interval=args.log_interval) File "/home/luban/.local/lib/python3.8/site-packages/coati/trainer/sft.py", line 102, in fit self.strategy.optimizer_step(self.optimizer) File "/home/luban/.local/lib/python3.8/site-packages/coati/trainer/strategies/colossalai.py", line 154, in optimizer_step optimizer.step() File "/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py", line 65, in wrapper return wrapped(*args, **kwargs) File "/home/luban/.local/lib/python3.8/site-packages/colossalai/zero/sharded_optim/low_level_optim.py", line 467, in step trainer.fit(logger=logger, log_interval=args.log_interval) File "/home/luban/.local/lib/python3.8/site-packages/coati/trainer/sft.py", line 102, in fit self.strategy.optimizer_step(self.optimizer) File "/home/luban/.local/lib/python3.8/site-packages/coati/trainer/strategies/colossalai.py", line 154, in optimizer_step assert param_shape == flat_fp32_avg_grads.shape,
AssertionError: fp32 param and grad have different shape torch.Size([5073920]) vs torch.Size([16384]) optimizer.step() File "/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py", line 65, in wrapper return wrapped(*args, **kwargs) File "/home/luban/.local/lib/python3.8/site-packages/colossalai/zero/sharded_optim/low_level_optim.py", line 467, in step assert param_shape == flat_fp32_avg_grads.shape,
AssertionError: fp32 param and grad have different shape torch.Size([5064704]) vs torch.Size([128000])

Environment

No response

Apr 03 '23 07:04 boundles

I am having the same issue here

Apr 03 '23 13:04 alibabadoufu

i have a similar issue too

│   153 │   def optimizer_step(self, optimizer: optim.Optimizer, **kwargs) -> None:                │
│ ❱ 154 │   │   optimizer.step()                                                                   │
│   155 │                                                                                          │
│   156 │   @staticmethod                                                                          │
│   157 │   def _unwrap_actor(actor: Actor) -> nn.Module:                                          │
│                                                                                                  │
│ /opt/conda/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:65 in wrapper                 │
│                                                                                                  │
│     62 │   │   │   │   instance = instance_ref()                                                 │
│     63 │   │   │   │   instance._step_count += 1                                                 │
│     64 │   │   │   │   wrapped = func.__get__(instance, cls)                                     │
│ ❱   65 │   │   │   │   return wrapped(*args, **kwargs)                                           │
│     66 │   │   │                                                                                 │
│     67 │   │   │   # Note that the returned function here is no longer a bound method,           │
│     68 │   │   │   # so attributes like `__func__` and `__self__` no longer exist.               │
│                                                                                                  │
│ /opt/conda/lib/python3.9/site-packages/colossalai/zero/sharded_optim/low_level_optim.py:467 in   │
│ step                                                                                             │
│                                                                                                  │
│   464 │   │   │   flat_fp32_avg_grads = flat_fp16_avg_grads.to(dtype)                            │
│   465 │   │   │                                                                                  │
│   466 │   │   │   param_shape = self._fp32_flat_param_groups_of_current_rank[group_id].shape     │
│ ❱ 467 │   │   │   assert param_shape == flat_fp32_avg_grads.shape, \                             │
│   468 │   │   │   │   f'fp32 param and grad have different shape {param_shape} vs {flat_fp32_a   │
│   469 │   │   │                                                                                  │
│   470 │   │   │   single_grad_partition_groups.append(flat_fp32_avg_grads)                       │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError: fp32 param and grad have different shape torch.Size([10138624]) vs torch.Size([144384]

Apr 03 '23 16:04 zhangliang-04

I am having the same issue here

Apr 04 '23 02:04 qinqinqaq

I am having the same issue here

Apr 11 '23 10:04 GongCQ

the same issue too, how do you solve this issue ?

Apr 20 '23 03:04 elven2016

the same issue too

Apr 20 '23 10:04 wangxiaobo007

the same issue too

May 05 '23 08:05 qianyuqiu79

This error occurs when I use both --lora_rank and --grad_checkpoint. Either use --lora_rank or --grad_checkpoint.

May 30 '23 03:05 l241025097

the same issue too

Jun 21 '23 04:06 zhangyuanscall

the same issue too

Jul 24 '23 11:07 pvop

the same issue too

Jul 25 '23 01:07 pvop

AssertionError: fp32 param and grad have different shape I have solved this error. I use GLM-10B to train reward model. The outputs of 'mems' is used as last_hidden_states. But the 'mems' is processed by detach which means it is removed in the computation graph. And the gradients can not penetrate to the model. Therefore, you should verify your last_hidden_states and ensure its presence in the computational graph.

Jul 25 '23 04:07 pvop

ColossalAI
ColossalAI copied to clipboard

[BUG]: fp32 param and grad have different shape torch.Size([5064704]) vs torch.Size([128000]) when use lora_rank=4 at stage 1

🐛 Describe the bug

Environment

ColossalAI ColossalAI copied to clipboard

[BUG]: fp32 param and grad have different shape torch.Size([5064704]) vs torch.Size([128000]) when use lora_rank=4 at stage 1

🐛 Describe the bug

Environment

ColossalAI
ColossalAI copied to clipboard