stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Bug]: Multi-GPU Dreambooth Training based on Accelerator does not work!

Open Dinxin opened this issue 2 years ago • 1 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues and checked the recent builds/commits

What happened?

I used the multi-GPU training function provided by accelerator library to reduce the training time of dreambooth.

This is the content of default_config.yaml

commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: 2,3
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

Actually, I want to use two of four GPUs available (gpu_ids=2,3) to conduct the training.

Unfortunately, I ran into the following two problems:

  • The program always stayed in the first GPU (namely, number 2), instead of staying in the two specified GPUs at the same time;
WeChatWorkScreenshot_e43073fb-7459-44b6-b8c6-5c32a3117512
  • There is no readable output in both server backend and web frontend.
WeChatWorkScreenshot_9f1e15c1-0f52-4ad2-9ddc-9dd5ed299997

The startup named webui-user-cuda1.sh is:


accelerate launch --config_file conf/default_config.yaml launch.py --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access

Steps to reproduce the problem

  1. Write the following content into conf/default_config.yaml:
command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: 2,3
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false```

2. Run the following command:
```export GRADIO_SERVER_PORT=8081

accelerate launch --config_file conf/default_config.yaml launch.py --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access

What should have happened?

I didn't see the correct multi-gpu training result.

Commit where the problem happens

9e3584f0edd2e64d284b6aaf9580ade5dcceed9d

What platforms do you use to access the UI ?

Linux

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

GRADIO_SERVER_PORT=8081

accelerate launch --config_file conf/default_config.yaml launch.py --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access

List of extensions

dreambooth

Console logs

Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0]
Commit hash: 7f8ab1ee8f304031b3404e25761dd0f4c7be7df8
Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0]
Commit hash: 7f8ab1ee8f304031b3404e25761dd0f4c7be7df8

#######################################################################################################
Initializing Dreambooth
If submitting an issue on github, please provide the below text for debugging purposes:

Python revision: 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0]
Dreambooth revision: 9e3584f0edd2e64d284b6aaf9580ade5dcceed9d
SD-WebUI revision: 

Checking Dreambooth requirements...
[+] bitsandbytes version 0.35.0 installed.
[+] diffusers version 0.10.2 installed.
[+] transformers version 4.25.1 installed.
[+] xformers version 0.0.13 installed.
[+] torch version 1.12.0 installed.
[+] torchvision version 0.13.0 installed.
#######################################################################################################

Launching Web UI with arguments: --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access

#######################################################################################################
Initializing Dreambooth
If submitting an issue on github, please provide the below text for debugging purposes:

Python revision: 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0]
Dreambooth revision: 9e3584f0edd2e64d284b6aaf9580ade5dcceed9d
SD-WebUI revision: 

Checking Dreambooth requirements...
[+] bitsandbytes version 0.35.0 installed.
[+] diffusers version 0.10.2 installed.
[+] transformers version 4.25.1 installed.
[+] xformers version 0.0.13 installed.
[+] torch version 1.12.0 installed.
[+] torchvision version 0.13.0 installed.
#######################################################################################################

Launching Web UI with arguments: --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access
libcudart.so.11.0: cannot open shared object file: No such file or directory
WARNING:root:WARNING: libcudart.so.11.0: cannot open shared object file: No such file or directory
Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop
libcudart.so.11.0: cannot open shared object file: No such file or directory
WARNING:root:WARNING: libcudart.so.11.0: cannot open shared object file: No such file or directory
Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop
Dreambooth API layer loaded
LatentDiffusion: Running in eps-prediction mode
Dreambooth API layer loaded
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loading weights [e6e8e1fc] from /cephfs/group/ieg-vc-vc-analysis/marcuschen/code/stable-diffusion-webui/models/Stable-diffusion/final-pruned.ckpt
Loading weights [e6e8e1fc] from /cephfs/group/ieg-vc-vc-analysis/marcuschen/code/stable-diffusion-webui/models/Stable-diffusion/final-pruned.ckpt
Applying cross attention optimization (Doggettx).
Model loaded.
Loaded a total of 1 textual inversion embeddings.
Embeddings: Candy Hearts
Applying cross attention optimization (Doggettx).
Model loaded.
Loaded a total of 1 textual inversion embeddings.
Embeddings: Candy Hearts
Running on local URL:  http://0.0.0.0:8081

To create a public link, set `share=True` in `launch()`.
Running on local URL:  http://0.0.0.0:8082

To create a public link, set `share=True` in `launch()`.
Loading model from checkpoint.
Loading checkpoint...
v1 model loaded.
Creating scheduler...
Converting unet...
Converting vae...
Converting text encoder...
Saving diffusers model...
 Restored system models. 
 Allocated: 2.0GB 
 Reserved: 2.0GB 

 Allocated 2.0/2.0GB 
 Reserved: 2.0/2.0GB 

Checkpoint successfully extracted to /cephfs/group/ieg-vc-vc-analysis/marcuschen/code/stable-diffusion-webui/models/dreambooth/xiaogong_girl/working
Concept 0 class dir is /cephfs/group/ieg-vc-vc-analysis/marcuschen/code/stable-diffusion-webui/models/dreambooth/xiaogong_girl/classifiers_0
Starting Dreambooth training...
 Allocated 0.0/2.0GB 
 Reserved: 0.0/2.0GB 

Initializing dreambooth training...
Patching transformers to fix kwargs errors.
/root/anaconda3/envs/novelai/lib/python3.8/site-packages/transformers/generation_utils.py:24: FutureWarning: Importing `GenerationMixin` from `src/transformers/generation_utils.py` is deprecated and will be removed in Transformers v5. Import as `from transformers import GenerationMixin` instead.
  warnings.warn(
Replace CrossAttention.forward to use default

Additional information

No response

Dinxin avatar Feb 16 '23 03:02 Dinxin

I have exactly the same problem is there a plan to add support for the multi-gpu setting? Because the size of the batch seems to be extremely important for quality when fine-tuning.

SavvaI avatar Mar 10 '23 11:03 SavvaI

We are also interested in development of this feature

fernando-deka avatar Apr 11 '23 09:04 fernando-deka