diffusers get stuck when save_state using DeepSpeed backend under training train_text_to_image

Describe the bug

When using DeepSpeed backend, training is ok but get stuck in accelerator.save_state(save_path). If use MULTI_GPU, the process is OK.

The training script is

accelerate launch train_text_to_image_lora.py \
    --pretrained_model_name_or_path="pretrain_models/stable-diffusion-v1-4/"  \
    --dataset_name="lambdalabs/pokemon-blip-captions"  \
    --output_dir="sd-pokemon-model-lora" \
    --resolution=512 \
    --gradient_accumulation_steps=1 \
    --checkpointing_steps=100 \
    --learning_rate=1e-4 \
    --lr_scheduler="constant" \
    --lr_warmup_steps=0 \
    --max_train_steps=500 \
    --validation_epochs=50 \
    --seed="0" \
    --checkpointing_steps 50 \
    --train_batch_size=1 \
    --use_8bit_adam \
    --enable_xformers_memory_efficient_attention

Reproduction

MULTI_GPU backend xx/accelerate/default_config.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: 1,2,3
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false

logs

03/08/2023 21:57:44 - INFO - __main__ - ***** Running training *****
03/08/2023 21:57:44 - INFO - __main__ -   Num examples = 833
03/08/2023 21:57:44 - INFO - __main__ -   Num Epochs = 2
03/08/2023 21:57:44 - INFO - __main__ -   Instantaneous batch size per device = 1
03/08/2023 21:57:44 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 3
03/08/2023 21:57:44 - INFO - __main__ -   Gradient Accumulation steps = 1
03/08/2023 21:57:44 - INFO - __main__ -   Total optimization steps = 500
Steps:  10%|████████▎                                                                          | 50/500 [00:11<01:31,  4.94it/s, lr=0.0001, step_loss=0.00245]03/08/2023 21:57:55 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-50
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Model weights saved in sd-pokemon-model-lora/checkpoint-50/pytorch_model.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora/checkpoint-50/optimizer.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora/checkpoint-50/scheduler.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Gradient scaler state saved in sd-pokemon-model-lora/checkpoint-50/scaler.pt
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora/checkpoint-50/random_states_0.pkl
03/08/2023 21:57:55 - INFO - __main__ - Saved state to sd-pokemon-model-lora/checkpoint-50
Steps:  20%|████████████████▌                                                                  | 100/500 [00:22<01:21,  4.92it/s, lr=0.0001, step_loss=0.0787]03/08/2023 21:58:06 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-100
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Model weights saved in sd-pokemon-model-lora/checkpoint-100/pytorch_model.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora/checkpoint-100/optimizer.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora/checkpoint-100/scheduler.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Gradient scaler state saved in sd-pokemon-model-lora/checkpoint-100/scaler.pt
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora/checkpoint-100/random_states_0.pkl
03/08/2023 21:58:06 - INFO - __main__ - Saved state to sd-pokemon-model-lora/checkpoint-100

DeepSpeed backend xx/accelerate/default_config.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false

Which I have commented self._checkpoint_tag_validation(tag) in runtime/engine.py or it stuck in this place. If commented, the logs is

03/08/2023 22:06:10 - INFO - __main__ - ***** Running training *****
03/08/2023 22:06:10 - INFO - __main__ -   Num examples = 833
03/08/2023 22:06:10 - INFO - __main__ -   Num Epochs = 2
03/08/2023 22:06:10 - INFO - __main__ -   Instantaneous batch size per device = 1
03/08/2023 22:06:10 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 3
03/08/2023 22:06:10 - INFO - __main__ -   Gradient Accumulation steps = 1
03/08/2023 22:06:10 - INFO - __main__ -   Total optimization steps = 500
Steps:  10%|████████▎                                                                          | 50/500 [00:11<01:36,  4.68it/s, lr=0.0001, step_loss=0.00255]03/08/2023 22:06:22 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-50
03/08/2023 22:06:22 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[2023-03-08 22:06:22,219] [INFO] [logging.py:75:log_dist] [Rank 0] [Torch] Checkpoint pytorch_model is begin to save!
/home/deepwisdom/anaconda3/envs/wjl/lib/python3.10/site-packages/torch/nn/modules/module.py:1432: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[2023-03-08 22:06:22,222] [INFO] [logging.py:75:log_dist] [Rank 0] Saving model checkpoint: sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt
[2023-03-08 22:06:22,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt...
[2023-03-08 22:06:22,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt.
...

get stuck in deepspeed/runtime/engine.py

# save_checkpoint
# https://github.com/microsoft/DeepSpeed/blob/v0.8.1/deepspeed/runtime/engine.py#LL3123C12-L3123C12

        if self.save_zero_checkpoint:
            self._create_zero_checkpoint_files(save_dir, tag)
            self._save_zero_checkpoint(save_dir, tag)

Logs

No response

System Info

Ubuntu 20.04 Nvidia GTX 3090 CUDA Version: 11.7 Torch: 1.13.1 Diffusers: 0.15.0.dev0 deepspeed: 0.8.1 xformers: 0.0.17.dev466 accelerate: 0.16.0

Mar 08 '23 14:03 better629

Hey @better629,

I'm not sure we have time currently to debug DeepSpeed integrations with diffusers, we'd be more than happy though to review a PR if you manage to fix the bug (or someone from the community)

Mar 09 '23 12:03 patrickvonplaten

Hey @better629,

I'm not sure we have time currently to debug DeepSpeed integrations with diffusers, we'd be more than happy though to review a PR if you manage to fix the bug (or someone from the community)

Do u have the config of DeepSpeed which run the training train_text_to_image_lora.py success before. There are lots of configuration, so I'm not sure if I use it correctlly. BTW, I will try to solve it myself.

Mar 09 '23 12:03 better629

@patrickvonplaten I have checked the logic of deepspeed to use save_state under it's examples. And found that accelerator.save_state not need to under main process by using if accelerator.is_main_process.
So I update the save code in train_text_to_image_lora.py to disable judge logic is_main_process from

                if global_step % args.checkpointing_steps == 0:
                    if accelerator.is_main_process:
                        save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
                        accelerator.save_state(save_path)
                        logger.info(f"Saved state to {save_path}")

to

                if global_step % args.checkpointing_steps == 0:
                    save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
                    logger.info(f"Start to save state to {save_path}")
                    accelerator.save_state(save_path)
                    logger.info(f"Saved state to {save_path}")

And then train and save successed using deepspeed zero-stage = 0 or 2.

It's a little confused for that I have seen that save the model only for the main process when using the distributed training mode. And it's a little different with deepspeed.

Mar 10 '23 03:03 better629

@patrickvonplaten since DeepSpeed saving-training-checkpoints has mentioned that all processes must call this method and not just the process with rank 0. So, there is no need to use accelerator.is_main_process before saving state.

Mar 15 '23 02:03 better629

@better629 Thank you for your solution. When I comment accelerator.is_main_process, I meet another error in save_model_hook fuction. Do you have any ideas? Traceback (most recent call last): File "train_text_to_image_dit.py", line 857, in <module> main() File "train_text_to_image_dit.py", line 824, in main accelerator.save_state(save_path) File "/new_share/dengjincan/conda/envs/diffusers2/lib/python3.8/site-packages/accelerate/accelerator.py", line 2026, in save_state hook(self._models, weights, output_dir) File "train_text_to_image_dit.py", line 494, in save_model_hook weights.pop() IndexError: pop from empty list

Apr 14 '23 09:04 JincanDeng

cc @patil-suraj since you're using DeepSpeed for OpenMUSE: https://huggingface.co/openMUSE it would be a nice community addition to improve the difusers training scripts a bit here afterward maybe :-)

Apr 20 '23 08:04 patrickvonplaten

@JincanDeng i have met the same error, have you solved this?

Jun 14 '23 02:06 garychan22

@patrickvonplaten I also faced the same error, hoping this can be elevated in priority. Currently this blocks ONNX Runtime Training integration with Diffusers+DeepSpeed.

Jun 16 '23 00:06 prathikr

We would need some help from DeepSpeed folks here I think

Jun 16 '23 18:06 patrickvonplaten

@patrickvonplaten I have checked the logic of deepspeed to use save_state under it's examples. And found that accelerator.save_state not need to under main process by using if accelerator.is_main_process. So I update the save code in train_text_to_image_lora.py to disable judge logic is_main_process from
                if global_step % args.checkpointing_steps == 0:
                    if accelerator.is_main_process:
                        save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
                        accelerator.save_state(save_path)
                        logger.info(f"Saved state to {save_path}")
to
                if global_step % args.checkpointing_steps == 0:
                    save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
                    logger.info(f"Start to save state to {save_path}")
                    accelerator.save_state(save_path)
                    logger.info(f"Saved state to {save_path}")
And then train and save successed using deepspeed zero-stage = 0 or 2.

It's a little confused for that I have seen that save the model only for the main process when using the distributed training mode. And it's a little different with deepspeed.

Hello, I also faced a similar problem, but I did not use LoRA for training. Your method works, thank you very much. But at the same time, after I modified it according to your method, the following error was prompted:

Traceback (most recent call last):
  File "/path/to/project/train_text_to_image.py", line 1185, in <module>
DEBUG []
    main()
  File "/path/to/project/train_text_to_image.py", line 1085, in main
Traceback (most recent call last):
    save_checkpoint()
  File "/path/to/project/train_text_to_image.py", line 1048, in save_checkpoint
  File "/path/to/project/train_text_to_image.py", line 1185, in <module>
    main()
  File "/path/to/project/train_text_to_image.py", line 1085, in main
    save_checkpoint()
  File "/path/to/project/train_text_to_image.py", line 1048, in save_checkpoint
    accelerator.save_state(save_path)
  File "/home/hello/anaconda3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2417, in save_state
    hook(self._models, weights, output_dir)
    accelerator.save_state(save_path)
  File "/home/hello/anaconda3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2417, in save_state
  File "/path/to/project/train_text_to_image.py", line 694, in save_model_hook
    weights.pop()
IndexError: pop from empty list
    hook(self._models, weights, output_dir)
  File "/path/to/project/train_text_to_image.py", line 694, in save_model_hook
    weights.pop()
IndexError: pop from empty list

The weights are always empty every time a checkpoint is saved. So should this line be deleted?weights.pop()

Jul 20 '23 15:07 kzwang001

@patrickvonplaten I have checked the logic of deepspeed to use save_state under it's examples. And found that accelerator.save_state not need to under main process by using if accelerator.is_main_process. So I update the save code in train_text_to_image_lora.py to disable judge logic is_main_process from
                if global_step % args.checkpointing_steps == 0:
                    if accelerator.is_main_process:
                        save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
                        accelerator.save_state(save_path)
                        logger.info(f"Saved state to {save_path}")
to
                if global_step % args.checkpointing_steps == 0:
                    save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
                    logger.info(f"Start to save state to {save_path}")
                    accelerator.save_state(save_path)
                    logger.info(f"Saved state to {save_path}")
And then train and save successed using deepspeed zero-stage = 0 or 2. It's a little confused for that I have seen that save the model only for the main process when using the distributed training mode. And it's a little different with deepspeed.
Hello, I also faced a similar problem, but I did not use LoRA for training. Your method works, thank you very much. But at the same time, after I modified it according to your method, the following error was prompted:
Traceback (most recent call last):
  File "/path/to/project/train_text_to_image.py", line 1185, in <module>
DEBUG []
    main()
  File "/path/to/project/train_text_to_image.py", line 1085, in main
Traceback (most recent call last):
    save_checkpoint()
  File "/path/to/project/train_text_to_image.py", line 1048, in save_checkpoint
  File "/path/to/project/train_text_to_image.py", line 1185, in <module>
    main()
  File "/path/to/project/train_text_to_image.py", line 1085, in main
    save_checkpoint()
  File "/path/to/project/train_text_to_image.py", line 1048, in save_checkpoint
    accelerator.save_state(save_path)
  File "/home/hello/anaconda3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2417, in save_state
    hook(self._models, weights, output_dir)
    accelerator.save_state(save_path)
  File "/home/hello/anaconda3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2417, in save_state
  File "/path/to/project/train_text_to_image.py", line 694, in save_model_hook
    weights.pop()
IndexError: pop from empty list
    hook(self._models, weights, output_dir)
  File "/path/to/project/train_text_to_image.py", line 694, in save_model_hook
    weights.pop()
IndexError: pop from empty list
The weights are always empty every time a checkpoint is saved. So should this line be deleted?weights.pop()

@kzwang001 Have you solved it? I met the same problem as well.

Jul 23 '23 05:07 deropty

any updates?

Aug 14 '23 09:08 Yangr116

I ran into the same IndexError: pop from empty list issue with the provided example script train_dreambooth.py under the ./examples/dreambooth directory in this repository, so all I ended up doing to fix the issue was to add a check to see if weights is NOT empty before popping it, and everything ran great after that. So basically in train_dreambooth.py the save_model_hook function went from being:

    # create custom saving & loading hooks so that `accelerator.save_state(...)` serializes in a nice format
    def save_model_hook(models, weights, output_dir):
        for model in models:
            sub_dir = "unet" if isinstance(model, type(accelerator.unwrap_model(unet))) else "text_encoder"
            model.save_pretrained(os.path.join(output_dir, sub_dir))
    
            # make sure to pop weight so that corresponding model is not saved again
            weights.pop()

to being:

    # create custom saving & loading hooks so that `accelerator.save_state(...)` serializes in a nice format
    def save_model_hook(models, weights, output_dir):
        for model in models:
            sub_dir = "unet" if isinstance(model, type(accelerator.unwrap_model(unet))) else "text_encoder"
            model.save_pretrained(os.path.join(output_dir, sub_dir))
    
            # make sure to pop weight so that corresponding model is not saved again
            if weights: # Don't pop if empty
                weights.pop()

Hope this may help anyone else running into the same issue :)

Sep 03 '23 10:09 tedliosu

I guess that, after the accelerator.is_main_process is commented, the multiple processes will execute save_model_hook repeatedly. So excute weights.pop() under the main process should work.

Sep 13 '23 08:09 mrwu-mac

@patrickvonplaten I have checked the logic of deepspeed to use save_state under it's examples. And found that accelerator.save_state not need to under main process by using if accelerator.is_main_process. So I update the save code in train_text_to_image_lora.py to disable judge logic is_main_process from
                if global_step % args.checkpointing_steps == 0:
                    if accelerator.is_main_process:
                        save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
                        accelerator.save_state(save_path)
                        logger.info(f"Saved state to {save_path}")
to
                if global_step % args.checkpointing_steps == 0:
                    save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
                    logger.info(f"Start to save state to {save_path}")
                    accelerator.save_state(save_path)
                    logger.info(f"Saved state to {save_path}")
And then train and save successed using deepspeed zero-stage = 0 or 2.

It's a little confused for that I have seen that save the model only for the main process when using the distributed training mode. And it's a little different with deepspeed.

it helped, thank you very much. But for some reason it only helped on a fresh server, the old one got to be dropped...

Dec 11 '23 23:12 kopyl

diffusers fixed this error in version "0.28.0.dev0" the same error when I finetuning SDXL with diffusers==0.26.3, but everything works after upgrade diffusers to version "0.28.0.dev0"

May 15 '24 09:05 MarinXue

@patrickvonplaten I have checked the logic of deepspeed to use save_state under it's examples. And found that accelerator.save_state not need to under main process by using if accelerator.is_main_process. So I update the save code in train_text_to_image_lora.py to disable judge logic is_main_process from
                if global_step % args.checkpointing_steps == 0:
                    if accelerator.is_main_process:
                        save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
                        accelerator.save_state(save_path)
                        logger.info(f"Saved state to {save_path}")
to
                if global_step % args.checkpointing_steps == 0:
                    save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
                    logger.info(f"Start to save state to {save_path}")
                    accelerator.save_state(save_path)
                    logger.info(f"Saved state to {save_path}")
And then train and save successed using deepspeed zero-stage = 0 or 2.

It's a little confused for that I have seen that save the model only for the main process when using the distributed training mode. And it's a little different with deepspeed.

This is still useful when using old-version diffusers.

Sep 22 '24 10:09 ZacGuanEr

diffusers
diffusers copied to clipboard

get stuck when save_state using DeepSpeed backend under training train_text_to_image_lora

Describe the bug

Reproduction

MULTI_GPU backend xx/accelerate/default_config.yaml

DeepSpeed backend xx/accelerate/default_config.yaml

Logs

System Info

diffusers diffusers copied to clipboard

get stuck when save_state using DeepSpeed backend under training train_text_to_image_lora

Describe the bug

Reproduction

MULTI_GPU backend xx/accelerate/default_config.yaml

DeepSpeed backend xx/accelerate/default_config.yaml

Logs

System Info

diffusers
diffusers copied to clipboard