ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: ZeroOptimizer in pipeline will stuck when only serval layers have parameters to be optimized

Open zeyugao opened this issue 1 year ago • 1 comments

🐛 Describe the bug

I using this configuration as example

        plugin = HybridParallelPlugin(
            tp_size=2,
            pp_size=2,
            zero_stage=1,
            microbatch_size=1,
            num_microbatches=None,
            enable_jit_fused=False,
            enable_fused_normalization=True,
            enable_flash_attention=True,
            precision=mixed_precision,
            initial_scale=1,
        )

The parameters needed to be optimized is only in first stage, for example, I freeze all parameters in llama, and has a projector to be optimized.

However it requires some extra hot patches to make it face the zero optimizer problem.

First, I add a check group_params is empty or not in https://github.com/hpcaitech/ColossalAI/blob/1a3315e33611a63c5aa2e2d507f1d51c8be0c9d2/colossalai/zero/low_level/low_level_optim.py#L163

So that https://github.com/hpcaitech/ColossalAI/blob/1a3315e33611a63c5aa2e2d507f1d51c8be0c9d2/colossalai/zero/low_level/low_level_optim.py#L215-L216 will not return 1. And throw error in https://github.com/hpcaitech/ColossalAI/blob/1a3315e33611a63c5aa2e2d507f1d51c8be0c9d2/colossalai/zero/low_level/low_level_optim.py#L589 as it will try to access the parameters in group 0, which it is empty.

Then when we call optimizer.step, in https://github.com/hpcaitech/ColossalAI/blob/1a3315e33611a63c5aa2e2d507f1d51c8be0c9d2/colossalai/zero/low_level/low_level_optim.py#L590

it will call

https://github.com/hpcaitech/ColossalAI/blob/1a3315e33611a63c5aa2e2d507f1d51c8be0c9d2/colossalai/booster/plugin/hybrid_parallel_plugin.py#L774

as we use HybridParallelPlugin with zero=1.

Then in https://github.com/hpcaitech/ColossalAI/blob/1a3315e33611a63c5aa2e2d507f1d51c8be0c9d2/colossalai/booster/plugin/hybrid_parallel_plugin.py#L846-L848

it will try to sync the total_norm_exponentiated_cuda along pp axis, however, as some stages' num_param_groups is 0, it will not call _compute_grad_norm to sync with other stages.

Environment

Pytorch 2.1.0 with CUDA 11.8 Colossalai with master branch version

zeyugao avatar Nov 04 '23 14:11 zeyugao

Thank you, we will fix it soon.

flybird11111 avatar Nov 21 '23 09:11 flybird11111