ColossalAI [BUG]: Maybe it is a bug in LowLevelZeroOptimizer

🐛 Describe the bug

I am not sure whether the following code https://github.com/hpcaitech/ColossalAI/blob/48d33b1b1753f19361e7e54a68a7ac5999dc02e4/colossalai/zero/sharded_optim/low_level_optim.py#L484-L488 should be changed into

 for group_id in range(self.num_param_groups): 
     for rank in range(self._world_size): 
         fp16_param = self._param_store.get_flat_fp16_param_by_rank_group(rank=rank, group_id=group_id) 
         rank = gpc.get_ranks_in_group(self._dp_parallel_mode)[rank]  # add this to make it right in model parallelism
         handle = dist.broadcast(fp16_param, src=rank, group=self._dp_group, async_op=True) 
         handles.append(handle)

This changed is needed since the src should be a global rank, not rank in the group. But I am not sure whether this change is suitable.

And another potential issue is about the norm calculation https://github.com/hpcaitech/ColossalAI/blob/48d33b1b1753f19361e7e54a68a7ac5999dc02e4/colossalai/zero/sharded_optim/low_level_optim.py#L443-L448 which uses the function in https://github.com/hpcaitech/ColossalAI/blob/48d33b1b1753f19361e7e54a68a7ac5999dc02e4/colossalai/zero/sharded_optim/_utils.py#L223-L227 But for the model parallelism, maybe the following is right https://github.com/hpcaitech/ColossalAI/blob/48d33b1b1753f19361e7e54a68a7ac5999dc02e4/colossalai/utils/common.py#L344-L365

Would you please check these, thx~

Environment

No response

Jan 07 '23 15:01 yhcc

@yhcc Thanks for your help. We have not tested hybrid parallel with low level zero and tensor parallel yet. Could you please do some tests and give a pull request if possible?

Jan 07 '23 15:01 feifeibear

I will try to provide a solution, but I am not familiar with the whole part of CAI, so there must have mistakes in my reparation. I am still working on it, after it is done, I will make a PR, hope it can give you some reference.

Jan 07 '23 19:01 yhcc

While I was repairing this bug, I find there may be another bug, https://github.com/hpcaitech/ColossalAI/blob/4898ff8af45a013f13e5fdadf9b240b2d240b3ca/colossalai/utils/common.py#L125-L126 I think this function cannot take good care of the ColoTensor, which is actually a model parallelism tensor. But I am not sure whether this will have bad effects. As far as I know, the following code may be influenced https://github.com/hpcaitech/ColossalAI/blob/4898ff8af45a013f13e5fdadf9b240b2d240b3ca/colossalai/utils/common.py#L290

Jan 07 '23 19:01 yhcc

I give a demo implementation in https://github.com/yhcc/ColossalAI/blob/main/colossalai/zero/sharded_optim/low_level_optim.py. To make it work with other part of ColossalAI, other modifications are also needed, such as do not use DDPGradientHandler to sync gradients.

Jan 11 '23 12:01 yhcc

We have updated a lot. This issue was closed due to inactivity. Thanks.

Apr 14 '23 09:04 binmakeswell