DeepSpeed
DeepSpeed copied to clipboard
[BUG] Multi-GPU performance worse than single GPU when using optimizers with moving averages (e.g.: Adam)
I will refer to the issue I opened on the accelerate github, because I am unsure if this is an accelerate, Deep Speed or torch issue. All of the details are there.
TL;DR: The performance when training a model with DeepSpeed is slightly worse in multi-GPU setup compared to single GPU, depending on the optimizer. This is true whether or not you use torch or DeepSpeed's optimizers.
Here's what I observed, running accelerate with DeepSpeed, 1 vs 2 GPUs
- First training step, everything is the same, model input, model weights, model output, loss, gradient, model weights after the weight update.
- Second training step, model input, model weights, model output, loss and gradient are the same. But then, the model weights after the weight update differ between the 1 vs 2 GPU setup.
- Then, for the following steps, since the weights have diverged, everything else diverges and the performance ends up worse for the 2 GPU setup compared to the 1 GPU setup.
Here's how one can reproduce this. I am using huggingface accelerate, but surely using solely DeepSpeed would result in the same problem.
OS and devices: A100 GPUs, Ubuntu 22.04, CUDA 11.8
Packages:
accelerate=0.25.0
deepspeed=0.12.6
python=3.10.13
pytorch=2.1.2
Code to reproduce:
import argparse
import torch
import torch.nn as nn
from accelerate import Accelerator
from accelerate.utils import set_seed, DummyOptim
class Model(nn.Module):
def __init__(self):
super().__init__()
self.w = nn.Linear(1, 1)
def forward(self, x):
return self.w(x)
class Dataset(torch.utils.data.IterableDataset):
def __iter__(self):
for _ in range(1000000000):
x = torch.rand(1) * 2
y = torch.zeros(1) if x < 1 else torch.ones(1)
yield x, y
def __len__(self):
return 1000000000
class DataLoader(torch.utils.data.DataLoader):
def __init__(self, batch_size):
super().__init__(Dataset(), batch_size=batch_size, shuffle=False)
def __iter__(self):
for batch in self.dataset:
yield batch
parser = argparse.ArgumentParser()
parser.add_argument('--batch_size', type=int)
parser.add_argument('--steps', type=int, default=10)
parser.add_argument('--seed', type=int, help='If not None, will set the seed for random, numpy, torch, torch.cuda and if TPUs are available torch_xla’s cuda state.')
args = parser.parse_args()
accelerator = Accelerator(
split_batches=True,
dispatch_batches=True)
if args.seed is not None:
set_seed(args.seed, device_specific=True)
model = Model().to(accelerator.device)
optimizer = DummyOptim(model.parameters(), lr=1e-1, weight_decay=0)
criterion = nn.BCEWithLogitsLoss()
dataloader = DataLoader(batch_size=args.batch_size)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for step, batch in enumerate(dataloader):
batch, target = batch
batch = batch.to(dtype=model.w.weight.dtype)
print(accelerator.local_process_index, batch, target)
print(accelerator.local_process_index, model.w.weight.data, model.w.bias.data)
output = model(batch)
print(accelerator.local_process_index, output)
loss = criterion(output, target)
print(accelerator.local_process_index, loss.item())
accelerator.backward(loss)
print(accelerator.local_process_index, model.w.weight.grad, model.w.bias.grad)
print(accelerator.local_process_index, model.w.weight.data, model.w.bias.data)
optimizer.step()
optimizer.zero_grad()
if (step + 1) == args.steps:
break
Feel free to comment out print
calls. Printing the weights after each update is enough to see the divergence.
Here's my accelerate config for the 1 GPU case:
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_config_file: deespeed_config.json
zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
And its deepspeed config:
{
"optimizer": {
"type": "AdamW",
"params": {
"lr": 1e-1,
"betas": [
0.9,
0.999
],
"eps": 1e-8,
"weight_decay": 0
}
},
"zero_optimization": {
"stage": 0,
"offload_optimizer": {
"device": "none",
"nvme_path": null
},
"offload_param": {
"device": "none",
"nvme_path": null
}
},
"bf16": {
"enabled": true
},
"fp16": {
"enabled": false
},
"train_batch_size": 64,
"train_micro_batch_size_per_gpu": 64,
"gradient_accumulation_steps": 1
}
And the 2 gpu accelerate config:
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_config_file: deespeed_config_B.json
zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
And its deepspeed config:
{
"optimizer": {
"type": "AdamW",
"params": {
"lr": 1e-1,
"betas": [
0.9,
0.999
],
"eps": 1e-8,
"weight_decay": 0
}
},
"zero_optimization": {
"stage": 0,
"offload_optimizer": {
"device": "none",
"nvme_path": null
},
"offload_param": {
"device": "none",
"nvme_path": null
}
},
"bf16": {
"enabled": true
},
"fp16": {
"enabled": false
},
"train_batch_size": 64,
"train_micro_batch_size_per_gpu": 32,
"gradient_accumulation_steps": 1
}
Now if you run the python script for both the 1 gpu and 2 gpu case:
accelerate launch --config_file=<accelerate_config_1gpu> <python_script> --batch_size=64 --steps=500 --seed=0
accelerate launch --config_file=<accelerate_config_2gpus> <python_script> --batch_size=64 --steps=500 --seed=0
You can compare the model weights and observe what I have described.
I ran seeds 0 to 4 included, averaged the loss and get the following graph:
This showcases the performance difference, where the orange loss is doing slightly worse than the blue loss. Note that if I change the optimizer to SGD, then the lines are identical, but using Adam results in a divergence.
If you need any other details, let me know.
FYI, I closed the issue that was opened on accelerate's side because I had two issues opened, one in accelerate's repo and one in DeepSpeed's repo, because I didn't know at the time where the problem was.
Hello, I observe the same except way worse - it diverges when I bump a number of GPUs. I'm training BLIP-2 model from scratch in bf16 precision, stage 2. You can see that DDP doesn't really change its behavior much adding GPUs, but Deepspeed goes worse and worse. I use Lightning, not accelerate, all params are at default. Any hints on how to debug this are appreciated!
Any update on this issue? I observe the same thing.