DeepSpeed [BUG] Zero Stage 2 - Access Item of Empty Dict duing Initialization

Describe the bug In zero/stage_1_and_2.py, initialize_gradient_partitioning_data_structures tries to access self.param_to_partition_ids at param_id=0 but it's an empty dictionary.
This shows up with stage 2, does not happen using stage 3. Besides, in zero/stage3.py, there is some legacy code which is similar to this particular part of code in stage_1_and_2.py, but never get called.

To Reproduce PyTorch v1.13.1 Transformers v4.21.0 DeepSpeed v0.8.3 (can reproduce with any version)

Run the following script on >= 2 nodes, 8 GPUs per node. (8 A100-40GB * 16 nodes in my case, but does not matter with the issue.)

import os
import deepspeed
import torch
import transformers
from transformers import (
    AutoModelForCausalLM,
    GPT2Config,
    GPT2LMHeadModel,
)

deepspeed.init_distributed(dist_backend="nccl")

model_config = GPT2Config(
    vocab_size=50257,
    n_positions=4096,
    n_embd=6144,
    n_layer=32,
    n_head=48,
    n_inner=None,
    activation_function="gelu_new",
    layer_norm_epsilon=1e-05,
    summary_type="cls_index",
    summary_use_proj=True,
    summary_activation=None,
    summary_proj_to_labels=True,
    gradient_checkpointing=True,
    use_cache=False,
    bos_token_id=50256,
    eos_token_id=50256,
    return_dict=True,
)

with deepspeed.zero.Init(dtype=torch.bfloat16):
    model = AutoModelForCausalLM.from_config(model_config)

model, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=None,
    model_parameters=model.parameters(),
    config=os.environ("DS_CONFIG_PATH"),
)

ds_config.json:

{
    "fp16": {
        "enabled": false
    },
    "bf16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage":2,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 12600000,
        "allgather_partitions": true,
        "allgather_bucket_size": 1260000000,
        "contiguous_gradients": true,
        "cpu_offload": false
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
          "lr": 0.001,
          "weight_decay": 0.05,
          "bias_correction": true,
          "betas": [
            0.9,
            0.95
          ],
          "eps": 1e-8
        }
    },
    "train_micro_batch_size_per_gpu": 1,
    "gradient_clipping": 1.0,
    "gradient_accumulation_steps": 1
}

Expected behavior The deepest trace of the error is like the following block on each rank

File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 687, in get_first_param_index
    if partition_id in self.param_to_partition_ids[group_id][param_id]:
KeyError: 0

System info (please complete the following information):

Ubuntu 20.04
2 nodes with x8 40GB A100s each
PyTorch v1.13.1
Transformers v4.21.0
DeepSpeed v0.8.3 (can reproduce with any version)

Launcher context

${SLURM_HOSTFILE} srun -N 1 docker exec ${CONTAINER_NAME} mpirun -N 8 --hostfile ${MPI_HOSTFILE} ...

Docker context Nothing special. But you may refer to pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker here. Install Transformers and DeepSpeed from source.

Apr 14 '23 14:04 StellarrZ

Any update when this will be resolved?

Jun 02 '23 21:06 mathephysicist

I looked into several source files and realized that with deepspeed.zero.Init is only usable for ZeRO-3. For Zero-1 and 2, we may have to remove the with block or set it like with deepspeed.zero.Init(enabled=is_zero3). However, it is likely to see it being very slow during this model initialization part using ZeRO-2 + HF

Jun 02 '23 21:06 StellarrZ

We acknowledge your solution is correct, which is that zero.init is only needed for ZeRO stage 3. This can be disabled by setting enabled flag to false. This will make it a noop and should not slow down model initialization.

As for unused legacy code, please provide references.

Jul 05 '23 15:07 samadejacobs

DeepSpeed DeepSpeed copied to clipboard

[BUG] Zero Stage 2 - Access Item of Empty Dict duing Initialization

DeepSpeed
DeepSpeed copied to clipboard