DeepSpeed [BUG] Multi-GPU performance worse than single GPU when using optimizers with moving averages (e.g.: Adam)

I will refer to the issue I opened on the accelerate github, because I am unsure if this is an accelerate, Deep Speed or torch issue. All of the details are there.

Oct 20 '23 20:10 GabPrato

TL;DR: The performance when training a model with DeepSpeed is slightly worse in multi-GPU setup compared to single GPU, depending on the optimizer. This is true whether or not you use torch or DeepSpeed's optimizers.

Here's what I observed, running accelerate with DeepSpeed, 1 vs 2 GPUs

First training step, everything is the same, model input, model weights, model output, loss, gradient, model weights after the weight update.
Second training step, model input, model weights, model output, loss and gradient are the same. But then, the model weights after the weight update differ between the 1 vs 2 GPU setup.
Then, for the following steps, since the weights have diverged, everything else diverges and the performance ends up worse for the 2 GPU setup compared to the 1 GPU setup.

Here's how one can reproduce this. I am using huggingface accelerate, but surely using solely DeepSpeed would result in the same problem.

OS and devices: A100 GPUs, Ubuntu 22.04, CUDA 11.8

Packages:

accelerate=0.25.0
deepspeed=0.12.6
python=3.10.13
pytorch=2.1.2

Code to reproduce:

import argparse
import torch
import torch.nn as nn
from accelerate import Accelerator
from accelerate.utils import set_seed, DummyOptim


class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.w = nn.Linear(1, 1)

    def forward(self, x):
        return self.w(x)


class Dataset(torch.utils.data.IterableDataset):
    def __iter__(self):
        for _ in range(1000000000):
            x = torch.rand(1) * 2
            y = torch.zeros(1) if x < 1 else torch.ones(1)
            yield x, y

    def __len__(self):
        return 1000000000


class DataLoader(torch.utils.data.DataLoader):
    def __init__(self, batch_size):
        super().__init__(Dataset(), batch_size=batch_size, shuffle=False)

    def __iter__(self):
        for batch in self.dataset:
            yield batch


parser = argparse.ArgumentParser()
parser.add_argument('--batch_size', type=int)
parser.add_argument('--steps', type=int, default=10)
parser.add_argument('--seed', type=int, help='If not None, will set the seed for random, numpy, torch, torch.cuda and if TPUs are available torch_xla’s cuda state.')
args = parser.parse_args()

accelerator = Accelerator(
    split_batches=True,
    dispatch_batches=True)

if args.seed is not None:
    set_seed(args.seed, device_specific=True)

model = Model().to(accelerator.device)
optimizer = DummyOptim(model.parameters(), lr=1e-1, weight_decay=0)
criterion = nn.BCEWithLogitsLoss()
dataloader = DataLoader(batch_size=args.batch_size)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

for step, batch in enumerate(dataloader):
    batch, target = batch
    batch = batch.to(dtype=model.w.weight.dtype)
    print(accelerator.local_process_index, batch, target)

    print(accelerator.local_process_index, model.w.weight.data, model.w.bias.data)

    output = model(batch)
    print(accelerator.local_process_index, output)
    loss = criterion(output, target)
    print(accelerator.local_process_index, loss.item())
    accelerator.backward(loss)
    print(accelerator.local_process_index, model.w.weight.grad, model.w.bias.grad)
    print(accelerator.local_process_index, model.w.weight.data, model.w.bias.data)
    optimizer.step()
    optimizer.zero_grad()

    if (step + 1) == args.steps:
        break

Feel free to comment out print calls. Printing the weights after each update is enough to see the divergence.

Here's my accelerate config for the 1 GPU case:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: deespeed_config.json
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

And its deepspeed config:

{
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 1e-1,
            "betas": [
                0.9,
                0.999
            ],
            "eps": 1e-8,
            "weight_decay": 0
        }
    },
    "zero_optimization": {
        "stage": 0,
        "offload_optimizer": {
            "device": "none",
            "nvme_path": null
        },
        "offload_param": {
            "device": "none",
            "nvme_path": null
        }
    },
    "bf16": {
        "enabled": true
    },
    "fp16": {
        "enabled": false
    },
    "train_batch_size": 64,
    "train_micro_batch_size_per_gpu": 64,
    "gradient_accumulation_steps": 1
}

And the 2 gpu accelerate config:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: deespeed_config_B.json
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

And its deepspeed config:

{
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 1e-1,
            "betas": [
                0.9,
                0.999
            ],
            "eps": 1e-8,
            "weight_decay": 0
        }
    },
    "zero_optimization": {
        "stage": 0,
        "offload_optimizer": {
            "device": "none",
            "nvme_path": null
        },
        "offload_param": {
            "device": "none",
            "nvme_path": null
        }
    },
    "bf16": {
        "enabled": true
    },
    "fp16": {
        "enabled": false
    },
    "train_batch_size": 64,
    "train_micro_batch_size_per_gpu": 32,
    "gradient_accumulation_steps": 1
}

Now if you run the python script for both the 1 gpu and 2 gpu case: accelerate launch --config_file=<accelerate_config_1gpu> <python_script> --batch_size=64 --steps=500 --seed=0 accelerate launch --config_file=<accelerate_config_2gpus> <python_script> --batch_size=64 --steps=500 --seed=0 You can compare the model weights and observe what I have described.

I ran seeds 0 to 4 included, averaged the loss and get the following graph: temp_deepspeed_example This showcases the performance difference, where the orange loss is doing slightly worse than the blue loss. Note that if I change the optimizer to SGD, then the lines are identical, but using Adam results in a divergence.

If you need any other details, let me know.

Dec 30 '23 16:12 GabPrato

FYI, I closed the issue that was opened on accelerate's side because I had two issues opened, one in accelerate's repo and one in DeepSpeed's repo, because I didn't know at the time where the problem was.

Dec 30 '23 16:12 GabPrato

Hello, I observe the same except way worse - it diverges when I bump a number of GPUs. I'm training BLIP-2 model from scratch in bf16 precision, stage 2. You can see that DDP doesn't really change its behavior much adding GPUs, but Deepspeed goes worse and worse. I use Lightning, not accelerate, all params are at default. Any hints on how to debug this are appreciated!

Jan 29 '24 15:01 olegsinavski

Any update on this issue? I observe the same thing.

May 07 '24 02:05 Martin7-1

DeepSpeed DeepSpeed copied to clipboard

[BUG] Multi-GPU performance worse than single GPU when using optimizers with moving averages (e.g.: Adam)

DeepSpeed
DeepSpeed copied to clipboard