diffusers [Dreambooth] Multi-GPU training with accelerate is magnitudes slower than single GPU (non-flax)

Describe the bug

I have access to a machine with several NVIDIA A100s. Initially I was using multiple GPUs for Dreambooth training under the idea that it would speed up training. After using both multiple GPUs and a single GPU to train, I have found the opposite to be true. The speed difference is an order of magnitude slower when using multiple GPUs. My knowledge of distributed GPU systems is limited, but my current suspicion is that it's accelerate slowing it down.

Multi-GPU (2 A100s) average speed (according to the tqdm):

2.59 s/it

Single GPU (1 A100) average speed (according to tqdm):

2.61 it/s

That is seconds / iteration versus iterations / second. The difference is insane, and leveraging the full strength of multi-gpu hardware is impossible because of it.

Reproduction

All images used during Dreambooth training are 512x512. The script used to launch is here:

accelerate launch train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR --output_dir=$OUTPUT_DIR \
--class_data_dir=$CLASS_DIR \
--class_prompt="woman" \
--with_prior_preservation --prior_loss_weight=1.0 \
--instance_prompt="bqp" --seed=111 --resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--mixed_precision="fp16" \
--learning_rate=2e-6 --num_class_images=250 --lr_scheduler="constant" --lr_warmup_steps=0 \
--max_train_steps=750 --train_text_encoder \

Here is my accelerate config for multi-gpu use (just two):

command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: 5,6
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

Here is my single GPU acclerate config:

command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: '5'
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

Logs

No response

System Info

- `diffusers` version: 0.11.0.dev0
- Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- PyTorch version (GPU?): 1.13.0+cu117 (True)
- Huggingface_hub version: 0.11.1
- Transformers version: 4.25.1
- Using GPU in script?: 1 or 2 Nvidia A100s (read above)
- Using distributed or parallel set-up in script?: Parallel

Dec 16 '22 21:12 jkcarney

We should definitely test a bit more for Dreambooth on multiple GPUs. Have we ever looked into this? cc @patil-suraj @pcuenca @williamberman ?

Dec 20 '22 00:12 patrickvonplaten

I have tested the script extensively on multi-GPU and it works totally fine in my experiments. A few notes:

The per-step time in a multi-GPU setting will be higher because in multi-GPU there's more work going on per step, to average gradients across the workers, sync parameters and gradients, etc.
Note, when using multi-GPU the effective batch size is num_gpus * train_batch_size as each GPU processes a batch of size, train_batch_size. So the 2.61 it/sec time that you see is essentially for a batch size of 2 (1 per GPU) and 2.51 is for BS of 1. So, it's not slow it's just using a big batch. And because the BS is multiplied in multi-GPU, you can reduce the number of training steps to an equivalent factor (for example in the case of two GPUs, you can halve the number of steps you were doing for a single GPU).

Hope this explains it a bit :)

Dec 27 '22 16:12 patil-suraj

That does help. I just want to confirm this before I test myself and verify it: So basically, if I was doing 800 steps on one GPU, I should get an identical inference performance on a model at 400 steps on 2 GPUs. Assuming that the seed was the same, and I used the same dataset/base model.

Dec 27 '22 16:12 jkcarney

Yes, with multi-gpu the steps can be reduced, and ideally it should be equivalent. In some cases we might need to do more or less steps.

Dec 27 '22 18:12 patil-suraj

Ok, here's the outcome of my experiment. I used stable diffusion 1.5 as a base with a subject dataset of my face. The single GPU run was done for 900 steps and the multi-GPU (2) run was done for 450 steps, both with the seed 1111. Train batch size is still 1.

Single GPU speed: 2.72 it/s, 6:00 time to complete Multi GPU speed: 2.98 s/it, 22:28 time to complete

xy_grid-0004-4141200262-Robert Pattinson portrait high fantasy warrior illustrated by Greg Rutkowski xy_grid-0006-3349102556-Robert Pattinson, multicolored background, paint on canvas, teal tones, art by (Artgerm)

From my experiments and playing with a couple prompts it's hard to say if one is definitively "better" than the other. Sometimes I lean more towards the single GPU 900 step model. Also the time difference is still astronomically different, nowhere close to each other, especially the time to complete despite the lower step count overall.

Dec 27 '22 19:12 jkcarney

Thanks @jkcarney , lemme me take a look again. What is your accelerate version and are you using xformers (It's enabled automatically if installed.) ?

Dec 28 '22 12:12 patil-suraj

accelerate==0.15.0

Not using xformers at the moment.

Dec 28 '22 13:12 jkcarney

I tried this again using a similar setup as yours, but couldn't reproduce it. Here's what I got on 2 A100s,

One GPU, 900 steps: 6:41
Two GPUs, 450 steps: 3:30

Here's the command that I used.

accelerate launch --mixed_precision="fp16" --multi_gpu --gpu_ids="0,1" \
  ../diffusers/examples/dreambooth/train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of skramer" \
  --resolution=512 \
  --train_batch_size=1 \
  --learning_rate=1e-6 --lr_scheduler="constant" --lr_warmup_steps=0 \
  --max_train_steps=450 \
  --seed=323234 \
  --class_prompt="photo of a person" \
  --class_data_dir="./datasets/Men" \
  --with_prior_preservation --num_class_images=200 \
  --train_text_encoder

Maybe there could be an issue with your setup but I'm not sure. As far as I have tried, multi-GPU doesn't slow down my training.

Dec 28 '22 15:12 patil-suraj

Interesting. I'm not quite sure how exactly the machine is configured, very possible that it's some issue there. I appreciate the assistance though. I'll close this issue since it's more of a me problem 😁

Dec 28 '22 16:12 jkcarney

@jkcarney did you have any findings? I'm asking because I have exactly the same observation. I was using RTX A4000 without xformers:

Single GPU speed is 2.62it/s, which is equivalent to 0.38s/it.
Two GPU speed is 1.20s/it. If we consider the batch size is 2x, it's equivalent to 0.6s/it.
Three GPU speed is 2.31s/it. If we consider the batch size is 3x, it's equivalent to 0.77s/it.

Generally we see a decrease in overall speed with more GPUs, even after normalizing the batch size.

Jan 21 '23 21:01 grapeot

Wanted to bridge into this discussion from another bug thread on this (https://github.com/huggingface/diffusers/issues/1851)

@williamberman @patil-suraj Wanted to comment that I'm not sure documentation is the only issue unless the fundamental problem is just in the accelerate config (my config is posted in the above linked bug thread). @patil-suraj unsure if you could post the config for your A100 test?

The discussion here seems to trend toward yes, multi-gpu has a slower it/s but the total number of steps is divided by number of GPUs so it balances to be faster.

But if you look at the results from @jkcarney , @grapeot , and myself, the story is different:

@jkcarney - 2.72 it/s --> 2.98 s/it -- that's not just a halving with 2 GPUs, that's an 8x slowdown on iteration
@grapeot - 2.62 it/s --> 1.20 s/it -- 3x iteration slowdown
mine - 1.96 it/s --> 5.75 s/it -- 11x iteration slowdown

I'm wondering if there's some hardware fun happening here. I'm running 3090s. I wonder if there could be NVLink impacts if not being utilized with multi-gpu? Unsure if multi-gpu means the GPUs need to share any memory state and without NVLink you're hitting PCI bottlenecks?

I've tried again with a fresh environment and still running into same issue. Seems like it's either accelerate config or hardware assuming everyone here is generally running the latest diffusers script.

Jan 23 '23 23:01 subpanic

@subpanic Yes very good points! Let me do some digging before addressing them more in depth

Jan 24 '23 16:01 williamberman

diffusers diffusers copied to clipboard

[Dreambooth] Multi-GPU training with accelerate is magnitudes slower than single GPU (non-flax)

Describe the bug

Reproduction

Logs

System Info

diffusers
diffusers copied to clipboard