stable-diffusion-webui [Bug]: Multi-GPU Dreambooth Training based on Accelerator does not work!

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

I used the multi-GPU training function provided by accelerator library to reduce the training time of dreambooth.

This is the content of default_config.yaml

commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: 2,3
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

Actually, I want to use two of four GPUs available (gpu_ids=2,3) to conduct the training.

Unfortunately, I ran into the following two problems:

The program always stayed in the first GPU (namely, number 2), instead of staying in the two specified GPUs at the same time;

WeChatWorkScreenshot_e43073fb-7459-44b6-b8c6-5c32a3117512

There is no readable output in both server backend and web frontend.

WeChatWorkScreenshot_9f1e15c1-0f52-4ad2-9ddc-9dd5ed299997

The startup named webui-user-cuda1.sh is:


accelerate launch --config_file conf/default_config.yaml launch.py --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access

Steps to reproduce the problem

Write the following content into conf/default_config.yaml:

command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: 2,3
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false```

2. Run the following command:
```export GRADIO_SERVER_PORT=8081

accelerate launch --config_file conf/default_config.yaml launch.py --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access

What should have happened?

I didn't see the correct multi-gpu training result.

Commit where the problem happens

9e3584f0edd2e64d284b6aaf9580ade5dcceed9d

What platforms do you use to access the UI ?

Linux

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

GRADIO_SERVER_PORT=8081

accelerate launch --config_file conf/default_config.yaml launch.py --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access

List of extensions

dreambooth

Console logs

Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0]
Commit hash: 7f8ab1ee8f304031b3404e25761dd0f4c7be7df8
Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0]
Commit hash: 7f8ab1ee8f304031b3404e25761dd0f4c7be7df8

#######################################################################################################
Initializing Dreambooth
If submitting an issue on github, please provide the below text for debugging purposes:

Python revision: 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0]
Dreambooth revision: 9e3584f0edd2e64d284b6aaf9580ade5dcceed9d
SD-WebUI revision: 

Checking Dreambooth requirements...
[+] bitsandbytes version 0.35.0 installed.
[+] diffusers version 0.10.2 installed.
[+] transformers version 4.25.1 installed.
[+] xformers version 0.0.13 installed.
[+] torch version 1.12.0 installed.
[+] torchvision version 0.13.0 installed.
#######################################################################################################

Launching Web UI with arguments: --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access

#######################################################################################################
Initializing Dreambooth
If submitting an issue on github, please provide the below text for debugging purposes:

Python revision: 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0]
Dreambooth revision: 9e3584f0edd2e64d284b6aaf9580ade5dcceed9d
SD-WebUI revision: 

Checking Dreambooth requirements...
[+] bitsandbytes version 0.35.0 installed.
[+] diffusers version 0.10.2 installed.
[+] transformers version 4.25.1 installed.
[+] xformers version 0.0.13 installed.
[+] torch version 1.12.0 installed.
[+] torchvision version 0.13.0 installed.
#######################################################################################################

Launching Web UI with arguments: --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access
libcudart.so.11.0: cannot open shared object file: No such file or directory
WARNING:root:WARNING: libcudart.so.11.0: cannot open shared object file: No such file or directory
Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop
libcudart.so.11.0: cannot open shared object file: No such file or directory
WARNING:root:WARNING: libcudart.so.11.0: cannot open shared object file: No such file or directory
Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop
Dreambooth API layer loaded
LatentDiffusion: Running in eps-prediction mode
Dreambooth API layer loaded
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loading weights [e6e8e1fc] from /cephfs/group/ieg-vc-vc-analysis/marcuschen/code/stable-diffusion-webui/models/Stable-diffusion/final-pruned.ckpt
Loading weights [e6e8e1fc] from /cephfs/group/ieg-vc-vc-analysis/marcuschen/code/stable-diffusion-webui/models/Stable-diffusion/final-pruned.ckpt
Applying cross attention optimization (Doggettx).
Model loaded.
Loaded a total of 1 textual inversion embeddings.
Embeddings: Candy Hearts
Applying cross attention optimization (Doggettx).
Model loaded.
Loaded a total of 1 textual inversion embeddings.
Embeddings: Candy Hearts
Running on local URL:  http://0.0.0.0:8081

To create a public link, set `share=True` in `launch()`.
Running on local URL:  http://0.0.0.0:8082

To create a public link, set `share=True` in `launch()`.
Loading model from checkpoint.
Loading checkpoint...
v1 model loaded.
Creating scheduler...
Converting unet...
Converting vae...
Converting text encoder...
Saving diffusers model...
 Restored system models. 
 Allocated: 2.0GB 
 Reserved: 2.0GB 

 Allocated 2.0/2.0GB 
 Reserved: 2.0/2.0GB 

Checkpoint successfully extracted to /cephfs/group/ieg-vc-vc-analysis/marcuschen/code/stable-diffusion-webui/models/dreambooth/xiaogong_girl/working
Concept 0 class dir is /cephfs/group/ieg-vc-vc-analysis/marcuschen/code/stable-diffusion-webui/models/dreambooth/xiaogong_girl/classifiers_0
Starting Dreambooth training...
 Allocated 0.0/2.0GB 
 Reserved: 0.0/2.0GB 

Initializing dreambooth training...
Patching transformers to fix kwargs errors.
/root/anaconda3/envs/novelai/lib/python3.8/site-packages/transformers/generation_utils.py:24: FutureWarning: Importing `GenerationMixin` from `src/transformers/generation_utils.py` is deprecated and will be removed in Transformers v5. Import as `from transformers import GenerationMixin` instead.
  warnings.warn(
Replace CrossAttention.forward to use default

Additional information

No response

Feb 16 '23 03:02 Dinxin

I have exactly the same problem is there a plan to add support for the multi-gpu setting? Because the size of the batch seems to be extremely important for quality when fine-tuning.

Mar 10 '23 11:03 SavvaI

We are also interested in development of this feature

Apr 11 '23 09:04 fernando-deka

stable-diffusion-webui stable-diffusion-webui copied to clipboard

[Bug]: Multi-GPU Dreambooth Training based on Accelerator does not work!

Is there an existing issue for this?

What happened?

Steps to reproduce the problem

What should have happened?

Commit where the problem happens

What platforms do you use to access the UI ?

What browsers do you use to access the UI ?

Command Line Arguments

List of extensions

Console logs

Additional information

stable-diffusion-webui
stable-diffusion-webui copied to clipboard