diffusers Advanced training SD1.5 has an issue when saving checkpoints

Advanced training SD1.5 has an issue when saving checkpoints

Open josemerinom opened this issue 7 months ago • 5 comments

Describe the bug

Today I trained using examples/dreambooth/train_dreambooth_lora.py in google colab, everything was OK

I wanted to try examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py I use the stable diffusion 1.5 model original (which I cloned on my HF), but when I try to save to the checkpoint, an error is generated

dataset = 10 images

checkpointing_steps=10 --> ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>

other error When I change the checkpoint to a number different from the number of images: checkpointing_steps=20 --> NameError: free variable 'pipeline' referenced before assignment in enclosing scope

validation prompt: None 06/30/2024 01:09:00 - INFO - main - ***** Running training *****

Reproduction

%cd /content !mkdir /content/cache !mkdir /content/dataset !mkdir /content/log !mkdir /content/train !git clone --branch v0.29.2-patch https://github.com/huggingface/diffusers !pip install accelerate==0.31.0 !pip install datasets==2.19.0 !pip install ftfy==6.2.0 !pip install Jinja2==3.1.4 !pip install peft==0.11.1 !pip install tensorboard==2.15.2 !pip install torchvision==0.18.0+cu121 !pip install transformers==4.42.3 %cd /content/diffusers !pip install -e . !accelerate config %cd /content/diffusers/examples/advanced_diffusion_training

https://colab.research.google.com/github/josemerinom/test/blob/master/test.ipynb

Logs 1 (checkpointing_steps=10)

06/30/2024 01:08:45 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'prediction_type', 'variance_type', 'dynamic_thresholding_ratio', 'clip_sample_range', 'thresholding', 'timestep_spacing', 'rescale_betas_zero_snr', 'sample_max_value'} was not found in config. Values will be initialized to default values.
{'use_post_quant_conv', 'force_upcast', 'use_quant_conv', 'latents_std', 'scaling_factor', 'shift_factor', 'latents_mean'} was not found in config. Values will be initialized to default values.
{'num_class_embeds', 'encoder_hid_dim', 'projection_class_embeddings_input_dim', 'time_embedding_act_fn', 'use_linear_projection', 'resnet_skip_time_act', 'mid_block_only_cross_attention', 'dual_cross_attention', 'attention_type', 'time_cond_proj_dim', 'addition_embed_type_num_heads', 'time_embedding_type', 'conv_out_kernel', 'reverse_transformer_layers_per_block', 'class_embeddings_concat', 'resnet_time_scale_shift', 'class_embed_type', 'transformer_layers_per_block', 'encoder_hid_dim_type', 'conv_in_kernel', 'only_cross_attention', 'addition_time_embed_dim', 'resnet_out_scale_factor', 'cross_attention_norm', 'addition_embed_type', 'time_embedding_dim', 'mid_block_type', 'dropout', 'num_attention_heads', 'timestep_post_act', 'upcast_attention'} was not found in config. Values will be initialized to default values.
validation prompt: None
06/30/2024 01:09:00 - INFO - __main__ - ***** Running training *****
06/30/2024 01:09:00 - INFO - __main__ -   Num examples = 10
06/30/2024 01:09:00 - INFO - __main__ -   Num batches each epoch = 10
06/30/2024 01:09:00 - INFO - __main__ -   Num Epochs = 10
06/30/2024 01:09:00 - INFO - __main__ -   Instantaneous batch size per device = 1
06/30/2024 01:09:00 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
06/30/2024 01:09:00 - INFO - __main__ -   Gradient Accumulation steps = 1
06/30/2024 01:09:00 - INFO - __main__ -   Total optimization steps = 100
Steps:  10% 10/100 [00:07<00:50,  1.80it/s, loss=0.00439, lr=0.0001]06/30/2024 01:09:07 - INFO - accelerate.accelerator - Saving current state to /content/drive/MyDrive/train/checkpoint-10
/usr/local/lib/python3.10/dist-packages/peft/utils/save_and_load.py:195: UserWarning: Could not find a config file in /content/drive/MyDrive/zero/zero15 - will assume that the vocabulary was not modified.
  warnings.warn(
Traceback (most recent call last):
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 2002, in <module>
    main(args)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1791, in main
    accelerator.save_state(save_path)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2955, in save_state
    hook(self._models, weights, output_dir)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1293, in save_model_hook
    raise ValueError(f"unexpected save model: {model.__class__}")
ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>
Steps:  10% 10/100 [00:07<01:11,  1.25it/s, loss=0.00439, lr=0.0001]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

Logs 2 (checkpointing_steps=20)

06/30/2024 01:11:01 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'rescale_betas_zero_snr', 'variance_type', 'sample_max_value', 'thresholding', 'timestep_spacing', 'dynamic_thresholding_ratio', 'clip_sample_range', 'prediction_type'} was not found in config. Values will be initialized to default values.
{'latents_std', 'latents_mean', 'shift_factor', 'scaling_factor', 'force_upcast', 'use_quant_conv', 'use_post_quant_conv'} was not found in config. Values will be initialized to default values.
{'encoder_hid_dim', 'dropout', 'attention_type', 'resnet_out_scale_factor', 'time_embedding_type', 'conv_out_kernel', 'mid_block_only_cross_attention', 'transformer_layers_per_block', 'addition_embed_type_num_heads', 'num_attention_heads', 'only_cross_attention', 'num_class_embeds', 'time_embedding_act_fn', 'mid_block_type', 'addition_time_embed_dim', 'encoder_hid_dim_type', 'resnet_time_scale_shift', 'dual_cross_attention', 'class_embed_type', 'upcast_attention', 'resnet_skip_time_act', 'use_linear_projection', 'class_embeddings_concat', 'time_embedding_dim', 'addition_embed_type', 'conv_in_kernel', 'reverse_transformer_layers_per_block', 'timestep_post_act', 'projection_class_embeddings_input_dim', 'cross_attention_norm', 'time_cond_proj_dim'} was not found in config. Values will be initialized to default values.
validation prompt: None
06/30/2024 01:11:15 - INFO - __main__ - ***** Running training *****
06/30/2024 01:11:15 - INFO - __main__ -   Num examples = 10
06/30/2024 01:11:15 - INFO - __main__ -   Num batches each epoch = 10
06/30/2024 01:11:15 - INFO - __main__ -   Num Epochs = 10
06/30/2024 01:11:15 - INFO - __main__ -   Instantaneous batch size per device = 1
06/30/2024 01:11:15 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
06/30/2024 01:11:15 - INFO - __main__ -   Gradient Accumulation steps = 1
06/30/2024 01:11:15 - INFO - __main__ -   Total optimization steps = 100
Steps:  10% 10/100 [00:08<00:51,  1.74it/s, loss=0.125, lr=0.0001]  Traceback (most recent call last):
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 2002, in <module>
    main(args)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1854, in main
    images = [
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1855, in <listcomp>
    pipeline(**pipeline_args, generator=generator).images[0]
NameError: free variable 'pipeline' referenced before assignment in enclosing scope
Steps:  10% 10/100 [00:08<01:18,  1.14it/s, loss=0.125, lr=0.0001]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

System Info

🤗 Diffusers version: 0.29.2
Platform: Linux-6.1.85+-x86_64-with-glibc2.35
Running on a notebook?: No
Running on Google Colab?: No
Python version: 3.10.12
PyTorch version (GPU?): 2.3.0+cu121 (True)
Flax version (CPU?/GPU?/TPU?): 0.8.4 (gpu)
Jax version: 0.4.26
JaxLib version: 0.4.26
Huggingface_hub version: 0.23.4
Transformers version: 4.42.3
Accelerate version: 0.31.0
PEFT version: 0.11.1
Bitsandbytes version: not installed
Safetensors version: 0.4.3
xFormers version: not installed
Accelerator: Tesla T4, 15360 MiB VRAM
Using GPU in script?:
Using distributed or parallel set-up in script?:

Jun 29 '24 01:06 josemerinom

diffusers diffusers copied to clipboard

Advanced training SD1.5 has an issue when saving checkpoints

Describe the bug

Reproduction

Logs 1 (checkpointing_steps=10)

Logs 2 (checkpointing_steps=20)

System Info

diffusers
diffusers copied to clipboard