stable-diffusion-webui
stable-diffusion-webui copied to clipboard
[Bug]: Multi-GPU Dreambooth Training based on Accelerator does not work!
Is there an existing issue for this?
- [X] I have searched the existing issues and checked the recent builds/commits
What happened?
I used the multi-GPU training function provided by accelerator
library to reduce the training time of dreambooth.
This is the content of default_config.yaml
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: 2,3
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false
Actually, I want to use two of four GPUs available (gpu_ids=2,3) to conduct the training.
Unfortunately, I ran into the following two problems:
- The program always stayed in the first GPU (namely, number 2), instead of staying in the two specified GPUs at the same time;

- There is no readable output in both server backend and web frontend.

The startup named webui-user-cuda1.sh
is:
accelerate launch --config_file conf/default_config.yaml launch.py --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access
Steps to reproduce the problem
- Write the following content into
conf/default_config.yaml
:
command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: 2,3
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false```
2. Run the following command:
```export GRADIO_SERVER_PORT=8081
accelerate launch --config_file conf/default_config.yaml launch.py --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access
What should have happened?
I didn't see the correct multi-gpu training result.
Commit where the problem happens
9e3584f0edd2e64d284b6aaf9580ade5dcceed9d
What platforms do you use to access the UI ?
Linux
What browsers do you use to access the UI ?
Google Chrome
Command Line Arguments
GRADIO_SERVER_PORT=8081
accelerate launch --config_file conf/default_config.yaml launch.py --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access
List of extensions
dreambooth
Console logs
Python 3.8.0 (default, Nov 6 2019, 21:49:08)
[GCC 7.3.0]
Commit hash: 7f8ab1ee8f304031b3404e25761dd0f4c7be7df8
Python 3.8.0 (default, Nov 6 2019, 21:49:08)
[GCC 7.3.0]
Commit hash: 7f8ab1ee8f304031b3404e25761dd0f4c7be7df8
#######################################################################################################
Initializing Dreambooth
If submitting an issue on github, please provide the below text for debugging purposes:
Python revision: 3.8.0 (default, Nov 6 2019, 21:49:08)
[GCC 7.3.0]
Dreambooth revision: 9e3584f0edd2e64d284b6aaf9580ade5dcceed9d
SD-WebUI revision:
Checking Dreambooth requirements...
[+] bitsandbytes version 0.35.0 installed.
[+] diffusers version 0.10.2 installed.
[+] transformers version 4.25.1 installed.
[+] xformers version 0.0.13 installed.
[+] torch version 1.12.0 installed.
[+] torchvision version 0.13.0 installed.
#######################################################################################################
Launching Web UI with arguments: --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access
#######################################################################################################
Initializing Dreambooth
If submitting an issue on github, please provide the below text for debugging purposes:
Python revision: 3.8.0 (default, Nov 6 2019, 21:49:08)
[GCC 7.3.0]
Dreambooth revision: 9e3584f0edd2e64d284b6aaf9580ade5dcceed9d
SD-WebUI revision:
Checking Dreambooth requirements...
[+] bitsandbytes version 0.35.0 installed.
[+] diffusers version 0.10.2 installed.
[+] transformers version 4.25.1 installed.
[+] xformers version 0.0.13 installed.
[+] torch version 1.12.0 installed.
[+] torchvision version 0.13.0 installed.
#######################################################################################################
Launching Web UI with arguments: --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access
libcudart.so.11.0: cannot open shared object file: No such file or directory
WARNING:root:WARNING: libcudart.so.11.0: cannot open shared object file: No such file or directory
Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop
libcudart.so.11.0: cannot open shared object file: No such file or directory
WARNING:root:WARNING: libcudart.so.11.0: cannot open shared object file: No such file or directory
Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop
Dreambooth API layer loaded
LatentDiffusion: Running in eps-prediction mode
Dreambooth API layer loaded
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loading weights [e6e8e1fc] from /cephfs/group/ieg-vc-vc-analysis/marcuschen/code/stable-diffusion-webui/models/Stable-diffusion/final-pruned.ckpt
Loading weights [e6e8e1fc] from /cephfs/group/ieg-vc-vc-analysis/marcuschen/code/stable-diffusion-webui/models/Stable-diffusion/final-pruned.ckpt
Applying cross attention optimization (Doggettx).
Model loaded.
Loaded a total of 1 textual inversion embeddings.
Embeddings: Candy Hearts
Applying cross attention optimization (Doggettx).
Model loaded.
Loaded a total of 1 textual inversion embeddings.
Embeddings: Candy Hearts
Running on local URL: http://0.0.0.0:8081
To create a public link, set `share=True` in `launch()`.
Running on local URL: http://0.0.0.0:8082
To create a public link, set `share=True` in `launch()`.
Loading model from checkpoint.
Loading checkpoint...
v1 model loaded.
Creating scheduler...
Converting unet...
Converting vae...
Converting text encoder...
Saving diffusers model...
Restored system models.
Allocated: 2.0GB
Reserved: 2.0GB
Allocated 2.0/2.0GB
Reserved: 2.0/2.0GB
Checkpoint successfully extracted to /cephfs/group/ieg-vc-vc-analysis/marcuschen/code/stable-diffusion-webui/models/dreambooth/xiaogong_girl/working
Concept 0 class dir is /cephfs/group/ieg-vc-vc-analysis/marcuschen/code/stable-diffusion-webui/models/dreambooth/xiaogong_girl/classifiers_0
Starting Dreambooth training...
Allocated 0.0/2.0GB
Reserved: 0.0/2.0GB
Initializing dreambooth training...
Patching transformers to fix kwargs errors.
/root/anaconda3/envs/novelai/lib/python3.8/site-packages/transformers/generation_utils.py:24: FutureWarning: Importing `GenerationMixin` from `src/transformers/generation_utils.py` is deprecated and will be removed in Transformers v5. Import as `from transformers import GenerationMixin` instead.
warnings.warn(
Replace CrossAttention.forward to use default
Additional information
No response
I have exactly the same problem is there a plan to add support for the multi-gpu setting? Because the size of the batch seems to be extremely important for quality when fine-tuning.
We are also interested in development of this feature