DeepSpeed
DeepSpeed copied to clipboard
[BUG] Zero Stage 2 - Access Item of Empty Dict duing Initialization
Describe the bug
In zero/stage_1_and_2.py, initialize_gradient_partitioning_data_structures tries to access self.param_to_partition_ids at param_id=0 but it's an empty dictionary.
This shows up with stage 2, does not happen using stage 3. Besides, in zero/stage3.py, there is some legacy code which is similar to this particular part of code in stage_1_and_2.py, but never get called.
To Reproduce PyTorch v1.13.1 Transformers v4.21.0 DeepSpeed v0.8.3 (can reproduce with any version)
Run the following script on >= 2 nodes, 8 GPUs per node. (8 A100-40GB * 16 nodes in my case, but does not matter with the issue.)
import os
import deepspeed
import torch
import transformers
from transformers import (
AutoModelForCausalLM,
GPT2Config,
GPT2LMHeadModel,
)
deepspeed.init_distributed(dist_backend="nccl")
model_config = GPT2Config(
vocab_size=50257,
n_positions=4096,
n_embd=6144,
n_layer=32,
n_head=48,
n_inner=None,
activation_function="gelu_new",
layer_norm_epsilon=1e-05,
summary_type="cls_index",
summary_use_proj=True,
summary_activation=None,
summary_proj_to_labels=True,
gradient_checkpointing=True,
use_cache=False,
bos_token_id=50256,
eos_token_id=50256,
return_dict=True,
)
with deepspeed.zero.Init(dtype=torch.bfloat16):
model = AutoModelForCausalLM.from_config(model_config)
model, optimizer, _, _ = deepspeed.initialize(
model=model,
optimizer=None,
model_parameters=model.parameters(),
config=os.environ("DS_CONFIG_PATH"),
)
ds_config.json:
{
"fp16": {
"enabled": false
},
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage":2,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 12600000,
"allgather_partitions": true,
"allgather_bucket_size": 1260000000,
"contiguous_gradients": true,
"cpu_offload": false
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 0.001,
"weight_decay": 0.05,
"bias_correction": true,
"betas": [
0.9,
0.95
],
"eps": 1e-8
}
},
"train_micro_batch_size_per_gpu": 1,
"gradient_clipping": 1.0,
"gradient_accumulation_steps": 1
}
Expected behavior The deepest trace of the error is like the following block on each rank
File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 687, in get_first_param_index
if partition_id in self.param_to_partition_ids[group_id][param_id]:
KeyError: 0
System info (please complete the following information):
- Ubuntu 20.04
- 2 nodes with x8 40GB A100s each
- PyTorch v1.13.1
- Transformers v4.21.0
- DeepSpeed v0.8.3 (can reproduce with any version)
Launcher context
${SLURM_HOSTFILE} srun -N 1 docker exec ${CONTAINER_NAME} mpirun -N 8 --hostfile ${MPI_HOSTFILE} ...
Docker context
Nothing special. But you may refer to pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker here. Install Transformers and DeepSpeed from source.
Any update when this will be resolved?
I looked into several source files and realized that with deepspeed.zero.Init is only usable for ZeRO-3. For Zero-1 and 2, we may have to remove the with block or set it like with deepspeed.zero.Init(enabled=is_zero3). However, it is likely to see it being very slow during this model initialization part using ZeRO-2 + HF
We acknowledge your solution is correct, which is that zero.init is only needed for ZeRO stage 3. This can be disabled by setting enabled flag to false. This will make it a noop and should not slow down model initialization.
As for unused legacy code, please provide references.