diffusers Dreambooth example with multi-gpu significantly slower than single GPU

Describe the bug

If the accelerate config is setup for multi-GPU (default config works) then training speed appears to dramatically slow. With a 2x 3090 system I'm seeing training go from ~2 it/s with single 3090 configured to ~6 s/it with multi-GPU configured.

I'm guessing I'm either misunderstanding what should be happening in a multi-GPU context or there's more configuration needed?

Reproduction

Generate the default config with accelerate config default and then run the dreambooth script with the following args:

accelerate launch train_dreambooth.py --pretrained_model_name_or_path=/store/sd/diffusers_models/stable-diffusion-2-1 --instance_data_dir=/path/to/data --output_dir=/path/to/out --instance_prompt="instance prompt here" --resolution=768 --train_batch_size=1 --gradient_accumulation_steps=1 --learning_rate=2e-6 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=5000 --train_text_encoder --enable_xformers_memory_efficient_attention --sample_batch_size=4 --checkpointing_steps=500 --use_8bit_adam

Note I'm seeing the same with any value for --mixed-precision if that's also supplied.

Logs

With multi-gpu (2x 3090)
[00:40<7:58:13,  5.75s/it, loss=0.175, lr=2e-6]

Single 3090
[00:11<42:23,  1.96it/s, loss=0.299, lr=2e-6]

System Info

Latest diffusers commit [debc74f]

Can provide any module versions people think are relevant but generally have the latest requirements.txt installed.

Dec 28 '22 21:12 subpanic

Can you share the contents of your .cache/huggingface/accelerate/default_config.yaml file ? It'll help in understanding if accelerate is able to find both your GPU's

The path to the file should be listed when you previously generated your config or you can run accelerate config default and it should point to the already existing yaml file.

Dec 29 '22 07:12 Abhinay1997

Sure, here is the config:

command_file: null                                                                                                                                                                  
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: 0,1
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

Can also confirm both GPUs are being used. Watching nvidia-smi I can see them both fully loading VRAM for training and experiencing ~100% utilization during training,

Dec 29 '22 15:12 subpanic

cc @patil-suraj @williamberman @pcuenca

Jan 05 '23 21:01 patrickvonplaten

Same problem here. Any solutions?

Jan 20 '23 07:01 Teoge

cc @patil-suraj again ;-)

Jan 23 '23 06:01 patrickvonplaten

Similar issue #1734, answered here https://github.com/huggingface/diffusers/issues/1734#issuecomment-1366017170.

Jan 23 '23 08:01 patil-suraj

I'll add some docs to the dreambooth readme for multi-gpu training

Jan 23 '23 20:01 williamberman

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Feb 17 '23 15:02 github-actions[bot]

diffusers diffusers copied to clipboard

Dreambooth example with multi-gpu significantly slower than single GPU

Describe the bug

Reproduction

Logs

System Info

diffusers
diffusers copied to clipboard