ColossalAI
ColossalAI copied to clipboard
[BUG]: ZeroOptimizer in pipeline will stuck when only serval layers have parameters to be optimized
🐛 Describe the bug
I using this configuration as example
plugin = HybridParallelPlugin(
tp_size=2,
pp_size=2,
zero_stage=1,
microbatch_size=1,
num_microbatches=None,
enable_jit_fused=False,
enable_fused_normalization=True,
enable_flash_attention=True,
precision=mixed_precision,
initial_scale=1,
)
The parameters needed to be optimized is only in first stage, for example, I freeze all parameters in llama, and has a projector to be optimized.
However it requires some extra hot patches to make it face the zero optimizer problem.
First, I add a check group_params
is empty or not in https://github.com/hpcaitech/ColossalAI/blob/1a3315e33611a63c5aa2e2d507f1d51c8be0c9d2/colossalai/zero/low_level/low_level_optim.py#L163
So that https://github.com/hpcaitech/ColossalAI/blob/1a3315e33611a63c5aa2e2d507f1d51c8be0c9d2/colossalai/zero/low_level/low_level_optim.py#L215-L216 will not return 1. And throw error in https://github.com/hpcaitech/ColossalAI/blob/1a3315e33611a63c5aa2e2d507f1d51c8be0c9d2/colossalai/zero/low_level/low_level_optim.py#L589 as it will try to access the parameters in group 0, which it is empty.
Then when we call optimizer.step
, in https://github.com/hpcaitech/ColossalAI/blob/1a3315e33611a63c5aa2e2d507f1d51c8be0c9d2/colossalai/zero/low_level/low_level_optim.py#L590
it will call
https://github.com/hpcaitech/ColossalAI/blob/1a3315e33611a63c5aa2e2d507f1d51c8be0c9d2/colossalai/booster/plugin/hybrid_parallel_plugin.py#L774
as we use HybridParallelPlugin with zero=1.
Then in https://github.com/hpcaitech/ColossalAI/blob/1a3315e33611a63c5aa2e2d507f1d51c8be0c9d2/colossalai/booster/plugin/hybrid_parallel_plugin.py#L846-L848
it will try to sync the total_norm_exponentiated_cuda
along pp axis, however, as some stages' num_param_groups
is 0, it will not call _compute_grad_norm
to sync with other stages.
Environment
Pytorch 2.1.0 with CUDA 11.8 Colossalai with master branch version
Thank you, we will fix it soon.