diffusers
diffusers copied to clipboard
Advanced training SD1.5 has an issue when saving checkpoints
Describe the bug
Today I trained using examples/dreambooth/train_dreambooth_lora.py in google colab, everything was OK
I wanted to try examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py I use the stable diffusion 1.5 model original (which I cloned on my HF), but when I try to save to the checkpoint, an error is generated
dataset = 10 images
checkpointing_steps=10 --> ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>
other error When I change the checkpoint to a number different from the number of images: checkpointing_steps=20 --> NameError: free variable 'pipeline' referenced before assignment in enclosing scope
validation prompt: None 06/30/2024 01:09:00 - INFO - main - ***** Running training *****
Reproduction
%cd /content !mkdir /content/cache !mkdir /content/dataset !mkdir /content/log !mkdir /content/train !git clone --branch v0.29.2-patch https://github.com/huggingface/diffusers !pip install accelerate==0.31.0 !pip install datasets==2.19.0 !pip install ftfy==6.2.0 !pip install Jinja2==3.1.4 !pip install peft==0.11.1 !pip install tensorboard==2.15.2 !pip install torchvision==0.18.0+cu121 !pip install transformers==4.42.3 %cd /content/diffusers !pip install -e . !accelerate config %cd /content/diffusers/examples/advanced_diffusion_training
https://colab.research.google.com/github/josemerinom/test/blob/master/test.ipynb
Logs 1 (checkpointing_steps=10)
06/30/2024 01:08:45 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'prediction_type', 'variance_type', 'dynamic_thresholding_ratio', 'clip_sample_range', 'thresholding', 'timestep_spacing', 'rescale_betas_zero_snr', 'sample_max_value'} was not found in config. Values will be initialized to default values.
{'use_post_quant_conv', 'force_upcast', 'use_quant_conv', 'latents_std', 'scaling_factor', 'shift_factor', 'latents_mean'} was not found in config. Values will be initialized to default values.
{'num_class_embeds', 'encoder_hid_dim', 'projection_class_embeddings_input_dim', 'time_embedding_act_fn', 'use_linear_projection', 'resnet_skip_time_act', 'mid_block_only_cross_attention', 'dual_cross_attention', 'attention_type', 'time_cond_proj_dim', 'addition_embed_type_num_heads', 'time_embedding_type', 'conv_out_kernel', 'reverse_transformer_layers_per_block', 'class_embeddings_concat', 'resnet_time_scale_shift', 'class_embed_type', 'transformer_layers_per_block', 'encoder_hid_dim_type', 'conv_in_kernel', 'only_cross_attention', 'addition_time_embed_dim', 'resnet_out_scale_factor', 'cross_attention_norm', 'addition_embed_type', 'time_embedding_dim', 'mid_block_type', 'dropout', 'num_attention_heads', 'timestep_post_act', 'upcast_attention'} was not found in config. Values will be initialized to default values.
validation prompt: None
06/30/2024 01:09:00 - INFO - __main__ - ***** Running training *****
06/30/2024 01:09:00 - INFO - __main__ - Num examples = 10
06/30/2024 01:09:00 - INFO - __main__ - Num batches each epoch = 10
06/30/2024 01:09:00 - INFO - __main__ - Num Epochs = 10
06/30/2024 01:09:00 - INFO - __main__ - Instantaneous batch size per device = 1
06/30/2024 01:09:00 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 1
06/30/2024 01:09:00 - INFO - __main__ - Gradient Accumulation steps = 1
06/30/2024 01:09:00 - INFO - __main__ - Total optimization steps = 100
Steps: 10% 10/100 [00:07<00:50, 1.80it/s, loss=0.00439, lr=0.0001]06/30/2024 01:09:07 - INFO - accelerate.accelerator - Saving current state to /content/drive/MyDrive/train/checkpoint-10
/usr/local/lib/python3.10/dist-packages/peft/utils/save_and_load.py:195: UserWarning: Could not find a config file in /content/drive/MyDrive/zero/zero15 - will assume that the vocabulary was not modified.
warnings.warn(
Traceback (most recent call last):
File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 2002, in <module>
main(args)
File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1791, in main
accelerator.save_state(save_path)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2955, in save_state
hook(self._models, weights, output_dir)
File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1293, in save_model_hook
raise ValueError(f"unexpected save model: {model.__class__}")
ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>
Steps: 10% 10/100 [00:07<01:11, 1.25it/s, loss=0.00439, lr=0.0001]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
Logs 2 (checkpointing_steps=20)
06/30/2024 01:11:01 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'rescale_betas_zero_snr', 'variance_type', 'sample_max_value', 'thresholding', 'timestep_spacing', 'dynamic_thresholding_ratio', 'clip_sample_range', 'prediction_type'} was not found in config. Values will be initialized to default values.
{'latents_std', 'latents_mean', 'shift_factor', 'scaling_factor', 'force_upcast', 'use_quant_conv', 'use_post_quant_conv'} was not found in config. Values will be initialized to default values.
{'encoder_hid_dim', 'dropout', 'attention_type', 'resnet_out_scale_factor', 'time_embedding_type', 'conv_out_kernel', 'mid_block_only_cross_attention', 'transformer_layers_per_block', 'addition_embed_type_num_heads', 'num_attention_heads', 'only_cross_attention', 'num_class_embeds', 'time_embedding_act_fn', 'mid_block_type', 'addition_time_embed_dim', 'encoder_hid_dim_type', 'resnet_time_scale_shift', 'dual_cross_attention', 'class_embed_type', 'upcast_attention', 'resnet_skip_time_act', 'use_linear_projection', 'class_embeddings_concat', 'time_embedding_dim', 'addition_embed_type', 'conv_in_kernel', 'reverse_transformer_layers_per_block', 'timestep_post_act', 'projection_class_embeddings_input_dim', 'cross_attention_norm', 'time_cond_proj_dim'} was not found in config. Values will be initialized to default values.
validation prompt: None
06/30/2024 01:11:15 - INFO - __main__ - ***** Running training *****
06/30/2024 01:11:15 - INFO - __main__ - Num examples = 10
06/30/2024 01:11:15 - INFO - __main__ - Num batches each epoch = 10
06/30/2024 01:11:15 - INFO - __main__ - Num Epochs = 10
06/30/2024 01:11:15 - INFO - __main__ - Instantaneous batch size per device = 1
06/30/2024 01:11:15 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 1
06/30/2024 01:11:15 - INFO - __main__ - Gradient Accumulation steps = 1
06/30/2024 01:11:15 - INFO - __main__ - Total optimization steps = 100
Steps: 10% 10/100 [00:08<00:51, 1.74it/s, loss=0.125, lr=0.0001] Traceback (most recent call last):
File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 2002, in <module>
main(args)
File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1854, in main
images = [
File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1855, in <listcomp>
pipeline(**pipeline_args, generator=generator).images[0]
NameError: free variable 'pipeline' referenced before assignment in enclosing scope
Steps: 10% 10/100 [00:08<01:18, 1.14it/s, loss=0.125, lr=0.0001]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
System Info
- 🤗 Diffusers version: 0.29.2
- Platform: Linux-6.1.85+-x86_64-with-glibc2.35
- Running on a notebook?: No
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- Flax version (CPU?/GPU?/TPU?): 0.8.4 (gpu)
- Jax version: 0.4.26
- JaxLib version: 0.4.26
- Huggingface_hub version: 0.23.4
- Transformers version: 4.42.3
- Accelerate version: 0.31.0
- PEFT version: 0.11.1
- Bitsandbytes version: not installed
- Safetensors version: 0.4.3
- xFormers version: not installed
- Accelerator: Tesla T4, 15360 MiB VRAM
- Using GPU in script?:
- Using distributed or parallel set-up in script?: