diffusers
diffusers copied to clipboard
Accelerate error when training with train_dreambooth_lora_sdxl_advanced.py
Describe the bug
Encountered this error with zero information, when using 'train_dreambooth_lora_sdxl_advanced.py',
Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1075, in launch_command simple_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 681, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python', 'train_dreambooth_lora_sdxl_advanced.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix', '--instance_data_dir=training_images', '--instance_prompt=photo of ohwx man', '--class_prompt=photo of man', '--class_data_dir=man_dataset', '--output_dir=result', '--mixed_precision=fp16', '--resolution=1024', '--num_train_epochs=10', '--with_prior_preservation', '--prior_loss_weight=1.0', '--train_batch_size=1', '--repeats=20', '--gradient_accumulation_steps=1', '--train_text_encoder', '--gradient_checkpointing', '--learning_rate=1e-4', '--text_encoder_lr=5e-5', '--optimizer=adamW', '--num_class_images=3000', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--rank=128', '--seed=0']' died with <Signals.SIGKILL: 9>.
Reproduction
1.Clone https://github.com/huggingface/diffusers.git 2.cd diffusers -> pip install . 3.cd examples/advanced_diffusion_training 4.pip install -r requirements.txt 5.accelerate config default 6. Run training script using accelerate:
accelerate launch train_dreambooth_lora_sdxl_advanced.py \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
--instance_data_dir="training_images" \
--instance_prompt="photo of ohwx man" \
--class_prompt="photo of man" \
--class_data_dir="man_dataset" \
--output_dir="result" \
--mixed_precision="fp16" \
--resolution=1024 \
--num_train_epochs=10 \
--with_prior_preservation --prior_loss_weight=1.0 \
--train_batch_size=1 \
--repeats=20 \
--gradient_accumulation_steps=1 \
--train_text_encoder \
--gradient_checkpointing \
--learning_rate=1e-4 \
--text_encoder_lr=5e-5 \
--optimizer="adamW" \
--num_class_images=3000 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--rank=128 \
--seed="0"
Logs
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1075, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 681, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', 'train_dreambooth_lora_sdxl_advanced.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix', '--instance_data_dir=training_images', '--instance_prompt=photo of ohwx man', '--class_prompt=photo of man', '--class_data_dir=man_dataset', '--output_dir=result', '--mixed_precision=fp16', '--resolution=1024', '--num_train_epochs=10', '--with_prior_preservation', '--prior_loss_weight=1.0', '--train_batch_size=1', '--repeats=20', '--gradient_accumulation_steps=1', '--train_text_encoder', '--gradient_checkpointing', '--learning_rate=1e-4', '--text_encoder_lr=5e-5', '--optimizer=adamW', '--num_class_images=3000', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--rank=128', '--seed=0']' die
System Info
accelerate: 0.29.3 OS: ubuntu22.04 python version: 3.10.12 torch version: 2.1.0+cu118 numpy version: 1.24.1 GPU A6000
accelerate configuration is default
Who can help?
@sayakpaul
Cc: @linoytsaban
Could normal RAM be giving OOM? Can you keep track of how the amount of RAM used increases when you run the command?
increases
Already using a 50 gb ram
this is my config
There is a SIGKILL
. Could you examine the kernel's log:
dmesg --ctime | grep --ignore-case --before-context 1 "killed"
In addition to what @standardAI asked, did you try /were there other configs in which it worked ok? anything else in the logs before the error?
I tested it too. I get this error. @linoytsaban @standardAI
- Platform: Ubuntu 22.04.3 LTS - Linux-5.4.0-169-generic-x86_64-with-glibc2.35
- Running on a notebook?: No
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.23.0
- Transformers version: 4.41.0
- Accelerate version: 0.30.1
- PEFT version: 0.11.2.dev0
- Bitsandbytes version: 0.43.1
- Safetensors version: 0.4.3
- xFormers version: 0.0.24
- Accelerator: NVIDIA GeForce RTX 3090, 24576 MiB VRAM
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
Code:
accelerate launch train_dreambooth_lora_sdxl.py \
--pretrained_model_name_or_path="SG161222/RealVisXL_V4.0" \
--instance_data_dir="train/image" \
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
--output_dir="lora-trained-xl" \
--mixed_precision="fp16" \
--instance_prompt="a photo of try_on a model wearing" \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--learning_rate=1e-4 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="a photo of try_on a model wearing" \
--validation_epochs=25 \
--seed="0" \
--push_to_hub
I added these parameters and tested again. Error persists.
--enable_xformers_memory_efficient_attention \
--gradient_checkpointing \
--use_8bit_adam \
--mixed_precision="fp16" \
Code:
dmesg --ctime | grep --ignore-case --before-context 1 "killed"
Output: dmesg: read kernel buffer failed: Operation not permitted