[BUG]: Maybe it is a bug in LowLevelZeroOptimizer
🐛 Describe the bug
I am not sure whether the following code https://github.com/hpcaitech/ColossalAI/blob/48d33b1b1753f19361e7e54a68a7ac5999dc02e4/colossalai/zero/sharded_optim/low_level_optim.py#L484-L488 should be changed into
for group_id in range(self.num_param_groups):
for rank in range(self._world_size):
fp16_param = self._param_store.get_flat_fp16_param_by_rank_group(rank=rank, group_id=group_id)
rank = gpc.get_ranks_in_group(self._dp_parallel_mode)[rank] # add this to make it right in model parallelism
handle = dist.broadcast(fp16_param, src=rank, group=self._dp_group, async_op=True)
handles.append(handle)
This changed is needed since the src should be a global rank, not rank in the group. But I am not sure whether this change is suitable.
And another potential issue is about the norm calculation https://github.com/hpcaitech/ColossalAI/blob/48d33b1b1753f19361e7e54a68a7ac5999dc02e4/colossalai/zero/sharded_optim/low_level_optim.py#L443-L448 which uses the function in https://github.com/hpcaitech/ColossalAI/blob/48d33b1b1753f19361e7e54a68a7ac5999dc02e4/colossalai/zero/sharded_optim/_utils.py#L223-L227 But for the model parallelism, maybe the following is right https://github.com/hpcaitech/ColossalAI/blob/48d33b1b1753f19361e7e54a68a7ac5999dc02e4/colossalai/utils/common.py#L344-L365
Would you please check these, thx~
Environment
No response
@yhcc Thanks for your help. We have not tested hybrid parallel with low level zero and tensor parallel yet. Could you please do some tests and give a pull request if possible?
I will try to provide a solution, but I am not familiar with the whole part of CAI, so there must have mistakes in my reparation. I am still working on it, after it is done, I will make a PR, hope it can give you some reference.
While I was repairing this bug, I find there may be another bug, https://github.com/hpcaitech/ColossalAI/blob/4898ff8af45a013f13e5fdadf9b240b2d240b3ca/colossalai/utils/common.py#L125-L126 I think this function cannot take good care of the ColoTensor, which is actually a model parallelism tensor. But I am not sure whether this will have bad effects. As far as I know, the following code may be influenced https://github.com/hpcaitech/ColossalAI/blob/4898ff8af45a013f13e5fdadf9b240b2d240b3ca/colossalai/utils/common.py#L290
I give a demo implementation in https://github.com/yhcc/ColossalAI/blob/main/colossalai/zero/sharded_optim/low_level_optim.py. To make it work with other part of ColossalAI, other modifications are also needed, such as do not use DDPGradientHandler to sync gradients.
We have updated a lot. This issue was closed due to inactivity. Thanks.