CoLLiE
CoLLiE copied to clipboard
Zero 2 gets stuck when initializing optimizer states
I am using zero 2 in training. The process gets stuck when initializing optimizer states. I'm able to run it using tp.
Here is the package info:
python==3.10.13
deepspeed==0.12.6
transformers==4.30.2
Collie config:
config = CollieConfig.from_pretrained("meta-llama/Llama-2-7b-hf")
config.dp_size = 1
config.pp_size = 1
config.tp_size = 1
config.train_epochs = args.train_epochs
config.train_micro_batch_size = 1
config.gradient_accumulation_steps = 2
config.eval_batch_size = 1
config.eval_per_n_epochs = 1
config.use_flash = False
config.ds_config = {
"fp16": {
"enabled": True
},
"monitor_config": {
"enabled": True,
"tag": f"{args.tag}_ep{args.train_epochs}",
"wandb": {
"enabled": False,
}
},
"zero_optimization": {
"stage": 2,
},
}
The process gets stuck in the following state:
[2024-01-13 09:24:02,401] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-01-13 09:24:02,401] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-01-13 09:24:02,401] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-01-13 09:24:02,418] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2024-01-13 09:24:02,418] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2024-01-13 09:24:02,418] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2024-01-13 09:24:02,418] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 500,000,000
[2024-01-13 09:24:02,418] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 500,000,000
[2024-01-13 09:24:02,419] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
[2024-01-13 09:24:02,419] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
[2024-01-13 09:24:28,621] [INFO] [utils.py:791:see_memory_usage] Before initializing optimizer states
[2024-01-13 09:24:28,622] [INFO] [utils.py:792:see_memory_usage] MA 15.69 GB Max_MA 17.26 GB CA 17.26 GB Max_CA 17 GB
[2024-01-13 09:24:28,622] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 23.05 GB, percent = 3.1%