DeepSpeed High Peak GPU Memory with ZeRO Stage 3

Describe the bug While comparing ZeRO Stage 2 and ZeRO Stage 3, I found that the peak GPU memory utilization (as measured by the deepspeed.runtime.utils.memory_status function) is higher when using ZeRO Stage 3, which does not align with what is described in the ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.

I noticed this behavior at first while testing with small models on small hardware (e.g. 560M parameter model on 4x16GB T4 GPUs), and then observed it again after scaling up the model size and environment (13B parameter model on 8x80GB A100 GPUs).

Would like to understand where this memory overhead is coming from with ZeRO Stage 3, and figure out if this is a bug / if something missing in my config / if my test scenario or environment is not appropriate for ZeRO Stage 3.

Example Memory Stats Captured w/ ZeRO Stage 2

RANK=2 MEMSTATS Memory stats after training step device=cuda:2 current alloc=45.4646GB (delta=0.0000GB max=52.3739GB) current cache=57.1406GB (delta=0.0000GB max=57.1406GB)

Example Memory Stats Captured w/ ZeRO Stage 3

RANK=1 MEMSTATS Memory stats after training step device=cuda:1 current alloc=25.3114GB (delta=0.0000GB max=56.1189GB) current cache=69.2051GB (delta=0.0000GB max=69.2051GB)

To Reproduce Here is a minimal training script to reproduce the behavior:

import deepspeed
import os
import torch
from deepspeed.runtime.utils import memory_status
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling
from util import load_wikitext

def main():
    
    rank = int(os.getenv("LOCAL_RANK", "0"))
    world_size = int(os.getenv("WORLD_SIZE", "1"))

    model_name = "facebook/opt-13b"

    model = AutoModelForCausalLM.from_pretrained(model_name)
        
    deepspeed_config = {
        "train_micro_batch_size_per_gpu": 1,
        "optimizer": {
            "type": "Adam",
            "params": { "lr": 5e-5 }
        },
        "fp16": { "enabled": True },
        "zero_optimization": { "stage": 3 }
    }

    model_engine, _, _, _ = deepspeed.initialize(
        model=model,
        model_parameters=model.parameters(),
        config=deepspeed_config
    )
    model_engine.train()

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
    
    train_dataset = load_wikitext(tokenizer, collator, max_length=512).select(range(128))
    train_dataloader = DataLoader(
        train_dataset,
        batch_size=1,
        shuffle=False,
        sampler=DistributedSampler(train_dataset, num_replicas=world_size)
    )

    device = torch.device("cuda", rank)  
    for batch in train_dataloader:
        input_ids = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)
        outputs = model_engine(input_ids, labels=labels)
        
        model_engine.backward(outputs.loss)
        model_engine.step()

        memory_status("Memory stats after training step")

    return

if __name__ == "__main__":
    main()

Here is the load_wikitext() function used to prepare the training dataset (note: token masking functionality is not used in the above example):

def load_wikitext(tokenizer, collator, max_length=None):

    def mask_tokens(x):
        input_ids, labels = collator.torch_mask_tokens(x['input_ids'], special_tokens_mask=x['special_tokens_mask'])
        return {
            "input_ids": input_ids,
            "labels": labels
        }

    wikitext = datasets.load_dataset("wikitext", "wikitext-2-v1")
    train_dataset = wikitext["train"]
    
    train_dataset = train_dataset.map(lambda x: tokenizer(x["text"], max_length=max_length, padding='max_length', truncation=True, return_tensors='pt', return_special_tokens_mask=True), batched=True)
    train_dataset.set_format(type="torch", columns=["input_ids", "special_tokens_mask"])
    if collator.mlm:
        train_dataset = train_dataset.map(mask_tokens, remove_columns=['special_tokens_mask'])
    else:
        train_dataset = train_dataset.map(lambda x: {
            "input_ids": x["input_ids"],
            "labels": x["input_ids"]
        })

    return train_dataset

Expected behavior The ZeRO Paper describes the memory consumed by Stages 1, 2, and 3 using the equations below Image 6-11-23 at 3 28 PM

Based on these equations, I expected that the peak GPU memory consumption would be significantly lower when using ZeRO Stage 3 as compared to ZeRO Stage 2.

ds_report output

[2023-06-11 22:03:53,488] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/lib/python3/dist-packages/torch']
torch version .................... 1.13.1
deepspeed install path ........... ['/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.9.4, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.8

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types: 8x80GB A100
Interconnects (if applicable): N/A
Python version: 3.8.10

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else? I am using the deepspeed launcher.

Docker context Are you using a specific docker image that you can share? N/A

Additional context Add any other context about the problem here.

Jun 11 '23 22:06 gnovack

@gnovack Thanks for the script. Can you include the load_wikitext for repro?

Jun 12 '23 17:06 jomayeri

Sure, just added this function to the issue description

Jun 12 '23 19:06 gnovack

Hi all! I'm experimenting the same when trying to load dolly-12B into 4x A100 40GB for zero3 training. I'm using the transformers Trainer, and it seems that each process is trying to load the model as a whole in its GPU, which fails (full precision, so 12*4>40). My model is loaded normally on CPU by each process (4 in total). Then I call the trainer, which creates the Deepspeed engine.

I think it is related to this line: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L1048 which was recently changed to add this is_zero3_model, but even with the latest version of deepspeed it still fails.

Is there a way to ensure that, when calling deepspeed.initialize(), the model (which is initialy on CPU 4 times), is directly loaded sharded and not first on each GPU?

Jun 13 '23 10:06 jschweiz

@jschweiz Yours appears to be a separate issue. Take a look at this response and see if it helps. If not feel free to open an issue with code to reproduce and assign me.

Jun 13 '23 17:06 jomayeri

@gnovack I have investigated the issue, here are few points to consider:

If you follow the instructions in my previous comment it will show the proper way to use HfDeepSpeedConfig for ZeRO stage 3. By doing this I was able to get the MaxMem of Stage 3 equal to that of Stage 2.
For further optimizations to reduce mem usage you can adjust the intermediate communication buffers used in, such as the prefetch buffer size. Those can be found here
The best place to measure the difference between Stage 2 and 3 memory usage is after DeepSpeed initialization. Once in the training loop a lot of the transient allocated memory is up to the Pytorch memory allocator.

Jul 10 '23 19:07 jomayeri

DeepSpeed DeepSpeed copied to clipboard

High Peak GPU Memory with ZeRO Stage 3

DeepSpeed
DeepSpeed copied to clipboard