CoLLiE icon indicating copy to clipboard operation
CoLLiE copied to clipboard

Zero 2 gets stuck when initializing optimizer states

Open tengxiaoliu opened this issue 5 months ago • 0 comments

I am using zero 2 in training. The process gets stuck when initializing optimizer states. I'm able to run it using tp.

Here is the package info:

python==3.10.13
deepspeed==0.12.6
transformers==4.30.2

Collie config:

    config = CollieConfig.from_pretrained("meta-llama/Llama-2-7b-hf")
    config.dp_size = 1
    config.pp_size = 1
    config.tp_size = 1
    config.train_epochs = args.train_epochs
    config.train_micro_batch_size = 1
    config.gradient_accumulation_steps = 2
    config.eval_batch_size = 1
    config.eval_per_n_epochs = 1
    config.use_flash = False
    config.ds_config = {
        "fp16": {
            "enabled": True
        },
        "monitor_config": {
            "enabled": True,
            "tag": f"{args.tag}_ep{args.train_epochs}",
            "wandb": {
                "enabled": False,
            }
        },
        "zero_optimization": {
            "stage": 2,
        },
    }

The process gets stuck in the following state:

[2024-01-13 09:24:02,401] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-01-13 09:24:02,401] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-01-13 09:24:02,401] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-01-13 09:24:02,418] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2024-01-13 09:24:02,418] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2024-01-13 09:24:02,418] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2024-01-13 09:24:02,418] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 500,000,000
[2024-01-13 09:24:02,418] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 500,000,000
[2024-01-13 09:24:02,419] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
[2024-01-13 09:24:02,419] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
[2024-01-13 09:24:28,621] [INFO] [utils.py:791:see_memory_usage] Before initializing optimizer states
[2024-01-13 09:24:28,622] [INFO] [utils.py:792:see_memory_usage] MA 15.69 GB         Max_MA 17.26 GB         CA 17.26 GB         Max_CA 17 GB 
[2024-01-13 09:24:28,622] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 23.05 GB, percent = 3.1%

tengxiaoliu avatar Jan 13 '24 10:01 tengxiaoliu