diffusers
diffusers copied to clipboard
Accelerate error when training with train_dreambooth_lora_sdxl_advanced.py
Describe the bug
Encountered this error with zero information, when using 'train_dreambooth_lora_sdxl_advanced.py',
Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1075, in launch_command simple_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 681, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python', 'train_dreambooth_lora_sdxl_advanced.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix', '--instance_data_dir=training_images', '--instance_prompt=photo of ohwx man', '--class_prompt=photo of man', '--class_data_dir=man_dataset', '--output_dir=result', '--mixed_precision=fp16', '--resolution=1024', '--num_train_epochs=10', '--with_prior_preservation', '--prior_loss_weight=1.0', '--train_batch_size=1', '--repeats=20', '--gradient_accumulation_steps=1', '--train_text_encoder', '--gradient_checkpointing', '--learning_rate=1e-4', '--text_encoder_lr=5e-5', '--optimizer=adamW', '--num_class_images=3000', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--rank=128', '--seed=0']' died with <Signals.SIGKILL: 9>.
Reproduction
1.Clone https://github.com/huggingface/diffusers.git 2.cd diffusers -> pip install . 3.cd examples/advanced_diffusion_training 4.pip install -r requirements.txt 5.accelerate config default 6. Run training script using accelerate:
accelerate launch train_dreambooth_lora_sdxl_advanced.py \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
--instance_data_dir="training_images" \
--instance_prompt="photo of ohwx man" \
--class_prompt="photo of man" \
--class_data_dir="man_dataset" \
--output_dir="result" \
--mixed_precision="fp16" \
--resolution=1024 \
--num_train_epochs=10 \
--with_prior_preservation --prior_loss_weight=1.0 \
--train_batch_size=1 \
--repeats=20 \
--gradient_accumulation_steps=1 \
--train_text_encoder \
--gradient_checkpointing \
--learning_rate=1e-4 \
--text_encoder_lr=5e-5 \
--optimizer="adamW" \
--num_class_images=3000 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--rank=128 \
--seed="0"
Logs
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1075, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 681, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', 'train_dreambooth_lora_sdxl_advanced.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix', '--instance_data_dir=training_images', '--instance_prompt=photo of ohwx man', '--class_prompt=photo of man', '--class_data_dir=man_dataset', '--output_dir=result', '--mixed_precision=fp16', '--resolution=1024', '--num_train_epochs=10', '--with_prior_preservation', '--prior_loss_weight=1.0', '--train_batch_size=1', '--repeats=20', '--gradient_accumulation_steps=1', '--train_text_encoder', '--gradient_checkpointing', '--learning_rate=1e-4', '--text_encoder_lr=5e-5', '--optimizer=adamW', '--num_class_images=3000', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--rank=128', '--seed=0']' die
System Info
accelerate: 0.29.3 OS: ubuntu22.04 python version: 3.10.12 torch version: 2.1.0+cu118 numpy version: 1.24.1 GPU A6000
accelerate configuration is default
Who can help?
@sayakpaul
Cc: @linoytsaban
Could normal RAM be giving OOM? Can you keep track of how the amount of RAM used increases when you run the command?
increases
Already using a 50 gb ram
this is my config
There is a SIGKILL
. Could you examine the kernel's log:
dmesg --ctime | grep --ignore-case --before-context 1 "killed"
In addition to what @standardAI asked, did you try /were there other configs in which it worked ok? anything else in the logs before the error?
I tested it too. I get this error. @linoytsaban @standardAI
- Platform: Ubuntu 22.04.3 LTS - Linux-5.4.0-169-generic-x86_64-with-glibc2.35
- Running on a notebook?: No
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.23.0
- Transformers version: 4.41.0
- Accelerate version: 0.30.1
- PEFT version: 0.11.2.dev0
- Bitsandbytes version: 0.43.1
- Safetensors version: 0.4.3
- xFormers version: 0.0.24
- Accelerator: NVIDIA GeForce RTX 3090, 24576 MiB VRAM
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
Code:
accelerate launch train_dreambooth_lora_sdxl.py \
--pretrained_model_name_or_path="SG161222/RealVisXL_V4.0" \
--instance_data_dir="train/image" \
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
--output_dir="lora-trained-xl" \
--mixed_precision="fp16" \
--instance_prompt="a photo of try_on a model wearing" \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--learning_rate=1e-4 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="a photo of try_on a model wearing" \
--validation_epochs=25 \
--seed="0" \
--push_to_hub
I added these parameters and tested again. Error persists.
--enable_xformers_memory_efficient_attention \
--gradient_checkpointing \
--use_8bit_adam \
--mixed_precision="fp16" \
Code:
dmesg --ctime | grep --ignore-case --before-context 1 "killed"
Output: dmesg: read kernel buffer failed: Operation not permitted
Isn't running the dmesg
command with sudo
possible in that environment?
hey @nayan-dhabarde @kadirnar, I tried your params with my data and couldn't reproduce the error -
!accelerate launch train_dreambooth_lora_sdxl_advanced.py \
--pretrained_model_name_or_path="SG161222/RealVisXL_V4.0" \
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
--dataset_name="linoyts/Tuxemon" \
--output_dir="test" \
--mixed_precision="fp16" \
--instance_prompt="a cartoon of TOK tuxemon monster" \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--learning_rate=1e-4 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="a cartoon of TOK pink turtle tuxemon monster" \
--validation_epochs=25 \
--seed="0" \
--push_to_hub
and
!accelerate launch train_dreambooth_lora_sdxl_advanced.py \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
--dataset_name="linoyts/Tuxemon" \
--output_dir="test" \
--mixed_precision="fp16" \
--instance_prompt="a cartoon of TOK tuxemon monster" \
--resolution=1024 \
--num_train_epochs=10 \
--train_batch_size=1 \
--repeats=20 \
--gradient_accumulation_steps=1 \
--train_text_encoder \
--gradient_checkpointing \
--learning_rate=1e-4 \
--text_encoder_lr=5e-5 \
--optimizer="adamW" \
--num_class_images=3000 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--rank=128 \
--seed="0"\
--push_to_hub
does it only fail for you in these configs and works in others?
Hi @linoytsaban , I am training sd3 lora. There are 11,000 images and I am getting this error. But it works when given a smaller dataset. Or it works when I reduce the image-size parameter.
Code:
accelerate launch train_dreambooth_lora_sd3.py \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-3-medium-diffusers" \
--instance_data_dir="image" \
--output_dir="fb-outerwear" \
--mixed_precision="fp16" \
--instance_prompt="This photo is a outerwear" \
--resolution=256 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--learning_rate=4e-6 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=1000 \
--validation_prompt="This photo is a outerwear" \
--validation_epochs=25 \
--seed="0" \
--push_to_hub
Error:
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.30s/it]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', 'train_dreambooth_lora_sd3.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-3-medium-diffusers', '--instance_data_dir=image', '--output_dir=fb-outerwear', '--mixed_precision=fp16', '--instance_prompt=This photo is the Fenerbahce outerwear', '--resolution=1024', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--learning_rate=4e-6', '--report_to=wandb', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=1000', '--validation_prompt=This photo is the outerwear', '--validation_epochs=25', '--seed=0', '--push_to_hub']' died with <Signals.SIGKILL: 9>.
GPU: Nvidia A6000 48GB VRAM
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.