DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Multi-GPU performance worse than single GPU when using optimizers with moving averages (e.g.: Adam)

Open GabPrato opened this issue 1 year ago • 4 comments

I will refer to the issue I opened on the accelerate github, because I am unsure if this is an accelerate, Deep Speed or torch issue. All of the details are there.

GabPrato avatar Oct 20 '23 20:10 GabPrato

TL;DR: The performance when training a model with DeepSpeed is slightly worse in multi-GPU setup compared to single GPU, depending on the optimizer. This is true whether or not you use torch or DeepSpeed's optimizers.

Here's what I observed, running accelerate with DeepSpeed, 1 vs 2 GPUs

  • First training step, everything is the same, model input, model weights, model output, loss, gradient, model weights after the weight update.
  • Second training step, model input, model weights, model output, loss and gradient are the same. But then, the model weights after the weight update differ between the 1 vs 2 GPU setup.
  • Then, for the following steps, since the weights have diverged, everything else diverges and the performance ends up worse for the 2 GPU setup compared to the 1 GPU setup.

Here's how one can reproduce this. I am using huggingface accelerate, but surely using solely DeepSpeed would result in the same problem.

OS and devices: A100 GPUs, Ubuntu 22.04, CUDA 11.8

Packages:

accelerate=0.25.0
deepspeed=0.12.6
python=3.10.13
pytorch=2.1.2

Code to reproduce:

import argparse
import torch
import torch.nn as nn
from accelerate import Accelerator
from accelerate.utils import set_seed, DummyOptim


class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.w = nn.Linear(1, 1)

    def forward(self, x):
        return self.w(x)


class Dataset(torch.utils.data.IterableDataset):
    def __iter__(self):
        for _ in range(1000000000):
            x = torch.rand(1) * 2
            y = torch.zeros(1) if x < 1 else torch.ones(1)
            yield x, y

    def __len__(self):
        return 1000000000


class DataLoader(torch.utils.data.DataLoader):
    def __init__(self, batch_size):
        super().__init__(Dataset(), batch_size=batch_size, shuffle=False)

    def __iter__(self):
        for batch in self.dataset:
            yield batch


parser = argparse.ArgumentParser()
parser.add_argument('--batch_size', type=int)
parser.add_argument('--steps', type=int, default=10)
parser.add_argument('--seed', type=int, help='If not None, will set the seed for random, numpy, torch, torch.cuda and if TPUs are available torch_xla’s cuda state.')
args = parser.parse_args()

accelerator = Accelerator(
    split_batches=True,
    dispatch_batches=True)

if args.seed is not None:
    set_seed(args.seed, device_specific=True)

model = Model().to(accelerator.device)
optimizer = DummyOptim(model.parameters(), lr=1e-1, weight_decay=0)
criterion = nn.BCEWithLogitsLoss()
dataloader = DataLoader(batch_size=args.batch_size)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

for step, batch in enumerate(dataloader):
    batch, target = batch
    batch = batch.to(dtype=model.w.weight.dtype)
    print(accelerator.local_process_index, batch, target)

    print(accelerator.local_process_index, model.w.weight.data, model.w.bias.data)

    output = model(batch)
    print(accelerator.local_process_index, output)
    loss = criterion(output, target)
    print(accelerator.local_process_index, loss.item())
    accelerator.backward(loss)
    print(accelerator.local_process_index, model.w.weight.grad, model.w.bias.grad)
    print(accelerator.local_process_index, model.w.weight.data, model.w.bias.data)
    optimizer.step()
    optimizer.zero_grad()

    if (step + 1) == args.steps:
        break

Feel free to comment out print calls. Printing the weights after each update is enough to see the divergence.

Here's my accelerate config for the 1 GPU case:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: deespeed_config.json
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

And its deepspeed config:

{
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 1e-1,
            "betas": [
                0.9,
                0.999
            ],
            "eps": 1e-8,
            "weight_decay": 0
        }
    },
    "zero_optimization": {
        "stage": 0,
        "offload_optimizer": {
            "device": "none",
            "nvme_path": null
        },
        "offload_param": {
            "device": "none",
            "nvme_path": null
        }
    },
    "bf16": {
        "enabled": true
    },
    "fp16": {
        "enabled": false
    },
    "train_batch_size": 64,
    "train_micro_batch_size_per_gpu": 64,
    "gradient_accumulation_steps": 1
}

And the 2 gpu accelerate config:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: deespeed_config_B.json
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

And its deepspeed config:

{
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 1e-1,
            "betas": [
                0.9,
                0.999
            ],
            "eps": 1e-8,
            "weight_decay": 0
        }
    },
    "zero_optimization": {
        "stage": 0,
        "offload_optimizer": {
            "device": "none",
            "nvme_path": null
        },
        "offload_param": {
            "device": "none",
            "nvme_path": null
        }
    },
    "bf16": {
        "enabled": true
    },
    "fp16": {
        "enabled": false
    },
    "train_batch_size": 64,
    "train_micro_batch_size_per_gpu": 32,
    "gradient_accumulation_steps": 1
}

Now if you run the python script for both the 1 gpu and 2 gpu case: accelerate launch --config_file=<accelerate_config_1gpu> <python_script> --batch_size=64 --steps=500 --seed=0 accelerate launch --config_file=<accelerate_config_2gpus> <python_script> --batch_size=64 --steps=500 --seed=0 You can compare the model weights and observe what I have described.

I ran seeds 0 to 4 included, averaged the loss and get the following graph: temp_deepspeed_example This showcases the performance difference, where the orange loss is doing slightly worse than the blue loss. Note that if I change the optimizer to SGD, then the lines are identical, but using Adam results in a divergence.

If you need any other details, let me know.

GabPrato avatar Dec 30 '23 16:12 GabPrato

FYI, I closed the issue that was opened on accelerate's side because I had two issues opened, one in accelerate's repo and one in DeepSpeed's repo, because I didn't know at the time where the problem was.

GabPrato avatar Dec 30 '23 16:12 GabPrato

Hello, I observe the same except way worse - it diverges when I bump a number of GPUs. I'm training BLIP-2 model from scratch in bf16 precision, stage 2. You can see that DDP doesn't really change its behavior much adding GPUs, but Deepspeed goes worse and worse. I use Lightning, not accelerate, all params are at default. Any hints on how to debug this are appreciated!

image

olegsinavski avatar Jan 29 '24 15:01 olegsinavski

Any update on this issue? I observe the same thing.

Martin7-1 avatar May 07 '24 02:05 Martin7-1