diffusers
diffusers copied to clipboard
get stuck when save_state using DeepSpeed backend under training train_text_to_image_lora
Describe the bug
When using DeepSpeed backend, training is ok but get stuck in accelerator.save_state(save_path)
. If use MULTI_GPU, the process is OK.
The training script is
accelerate launch train_text_to_image_lora.py \
--pretrained_model_name_or_path="pretrain_models/stable-diffusion-v1-4/" \
--dataset_name="lambdalabs/pokemon-blip-captions" \
--output_dir="sd-pokemon-model-lora" \
--resolution=512 \
--gradient_accumulation_steps=1 \
--checkpointing_steps=100 \
--learning_rate=1e-4 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_epochs=50 \
--seed="0" \
--checkpointing_steps 50 \
--train_batch_size=1 \
--use_8bit_adam \
--enable_xformers_memory_efficient_attention
Reproduction
MULTI_GPU backend xx/accelerate/default_config.yaml
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: 1,2,3
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false
logs
03/08/2023 21:57:44 - INFO - __main__ - ***** Running training *****
03/08/2023 21:57:44 - INFO - __main__ - Num examples = 833
03/08/2023 21:57:44 - INFO - __main__ - Num Epochs = 2
03/08/2023 21:57:44 - INFO - __main__ - Instantaneous batch size per device = 1
03/08/2023 21:57:44 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 3
03/08/2023 21:57:44 - INFO - __main__ - Gradient Accumulation steps = 1
03/08/2023 21:57:44 - INFO - __main__ - Total optimization steps = 500
Steps: 10%|████████▎ | 50/500 [00:11<01:31, 4.94it/s, lr=0.0001, step_loss=0.00245]03/08/2023 21:57:55 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-50
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Model weights saved in sd-pokemon-model-lora/checkpoint-50/pytorch_model.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora/checkpoint-50/optimizer.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora/checkpoint-50/scheduler.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Gradient scaler state saved in sd-pokemon-model-lora/checkpoint-50/scaler.pt
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora/checkpoint-50/random_states_0.pkl
03/08/2023 21:57:55 - INFO - __main__ - Saved state to sd-pokemon-model-lora/checkpoint-50
Steps: 20%|████████████████▌ | 100/500 [00:22<01:21, 4.92it/s, lr=0.0001, step_loss=0.0787]03/08/2023 21:58:06 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-100
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Model weights saved in sd-pokemon-model-lora/checkpoint-100/pytorch_model.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora/checkpoint-100/optimizer.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora/checkpoint-100/scheduler.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Gradient scaler state saved in sd-pokemon-model-lora/checkpoint-100/scaler.pt
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora/checkpoint-100/random_states_0.pkl
03/08/2023 21:58:06 - INFO - __main__ - Saved state to sd-pokemon-model-lora/checkpoint-100
DeepSpeed backend xx/accelerate/default_config.yaml
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false
Which I have commented self._checkpoint_tag_validation(tag)
in runtime/engine.py
or it stuck in this place.
If commented, the logs is
03/08/2023 22:06:10 - INFO - __main__ - ***** Running training *****
03/08/2023 22:06:10 - INFO - __main__ - Num examples = 833
03/08/2023 22:06:10 - INFO - __main__ - Num Epochs = 2
03/08/2023 22:06:10 - INFO - __main__ - Instantaneous batch size per device = 1
03/08/2023 22:06:10 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 3
03/08/2023 22:06:10 - INFO - __main__ - Gradient Accumulation steps = 1
03/08/2023 22:06:10 - INFO - __main__ - Total optimization steps = 500
Steps: 10%|████████▎ | 50/500 [00:11<01:36, 4.68it/s, lr=0.0001, step_loss=0.00255]03/08/2023 22:06:22 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-50
03/08/2023 22:06:22 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[2023-03-08 22:06:22,219] [INFO] [logging.py:75:log_dist] [Rank 0] [Torch] Checkpoint pytorch_model is begin to save!
/home/deepwisdom/anaconda3/envs/wjl/lib/python3.10/site-packages/torch/nn/modules/module.py:1432: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
[2023-03-08 22:06:22,222] [INFO] [logging.py:75:log_dist] [Rank 0] Saving model checkpoint: sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt
[2023-03-08 22:06:22,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt...
[2023-03-08 22:06:22,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt.
...
get stuck in deepspeed/runtime/engine.py
# save_checkpoint
# https://github.com/microsoft/DeepSpeed/blob/v0.8.1/deepspeed/runtime/engine.py#LL3123C12-L3123C12
if self.save_zero_checkpoint:
self._create_zero_checkpoint_files(save_dir, tag)
self._save_zero_checkpoint(save_dir, tag)
Logs
No response
System Info
Ubuntu 20.04 Nvidia GTX 3090 CUDA Version: 11.7 Torch: 1.13.1 Diffusers: 0.15.0.dev0 deepspeed: 0.8.1 xformers: 0.0.17.dev466 accelerate: 0.16.0
Hey @better629,
I'm not sure we have time currently to debug DeepSpeed integrations with diffusers, we'd be more than happy though to review a PR if you manage to fix the bug (or someone from the community)
Hey @better629,
I'm not sure we have time currently to debug DeepSpeed integrations with diffusers, we'd be more than happy though to review a PR if you manage to fix the bug (or someone from the community)
Do u have the config of DeepSpeed which run the training train_text_to_image_lora.py
success before. There are lots of configuration, so I'm not sure if I use it correctlly.
BTW, I will try to solve it myself.
@patrickvonplaten
I have checked the logic of deepspeed to use save_state
under it's examples. And found that accelerator.save_state
not need to under main process by using if accelerator.is_main_process
.
So I update the save code in train_text_to_image_lora.py
to disable judge logic is_main_process
from
if global_step % args.checkpointing_steps == 0:
if accelerator.is_main_process:
save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
accelerator.save_state(save_path)
logger.info(f"Saved state to {save_path}")
to
if global_step % args.checkpointing_steps == 0:
save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
logger.info(f"Start to save state to {save_path}")
accelerator.save_state(save_path)
logger.info(f"Saved state to {save_path}")
And then train and save successed using deepspeed zero-stage = 0 or 2.
It's a little confused for that I have seen that save the model only for the main process when using the distributed training mode
. And it's a little different with deepspeed.
@patrickvonplaten since DeepSpeed saving-training-checkpoints has mentioned that all processes must call this method and not just the process with rank 0
. So, there is no need to use accelerator.is_main_process
before saving state.
@better629 Thank you for your solution. When I comment accelerator.is_main_process
, I meet another error in save_model_hook
fuction. Do you have any ideas?
Traceback (most recent call last): File "train_text_to_image_dit.py", line 857, in <module> main() File "train_text_to_image_dit.py", line 824, in main accelerator.save_state(save_path) File "/new_share/dengjincan/conda/envs/diffusers2/lib/python3.8/site-packages/accelerate/accelerator.py", line 2026, in save_state hook(self._models, weights, output_dir) File "train_text_to_image_dit.py", line 494, in save_model_hook weights.pop() IndexError: pop from empty list
cc @patil-suraj since you're using DeepSpeed for OpenMUSE: https://huggingface.co/openMUSE it would be a nice community addition to improve the difusers
training scripts a bit here afterward maybe :-)
@JincanDeng i have met the same error, have you solved this?
@patrickvonplaten I also faced the same error, hoping this can be elevated in priority. Currently this blocks ONNX Runtime Training integration with Diffusers+DeepSpeed.
We would need some help from DeepSpeed folks here I think
@patrickvonplaten I have checked the logic of deepspeed to use
save_state
under it's examples. And found thataccelerator.save_state
not need to under main process by usingif accelerator.is_main_process
. So I update the save code intrain_text_to_image_lora.py
to disable judge logicis_main_process
fromif global_step % args.checkpointing_steps == 0: if accelerator.is_main_process: save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}") accelerator.save_state(save_path) logger.info(f"Saved state to {save_path}")
to
if global_step % args.checkpointing_steps == 0: save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}") logger.info(f"Start to save state to {save_path}") accelerator.save_state(save_path) logger.info(f"Saved state to {save_path}")
And then train and save successed using deepspeed zero-stage = 0 or 2.
It's a little confused for that I have seen that
save the model only for the main process when using the distributed training mode
. And it's a little different with deepspeed.
Hello, I also faced a similar problem, but I did not use LoRA for training. Your method works, thank you very much. But at the same time, after I modified it according to your method, the following error was prompted:
Traceback (most recent call last):
File "/path/to/project/train_text_to_image.py", line 1185, in <module>
DEBUG []
main()
File "/path/to/project/train_text_to_image.py", line 1085, in main
Traceback (most recent call last):
save_checkpoint()
File "/path/to/project/train_text_to_image.py", line 1048, in save_checkpoint
File "/path/to/project/train_text_to_image.py", line 1185, in <module>
main()
File "/path/to/project/train_text_to_image.py", line 1085, in main
save_checkpoint()
File "/path/to/project/train_text_to_image.py", line 1048, in save_checkpoint
accelerator.save_state(save_path)
File "/home/hello/anaconda3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2417, in save_state
hook(self._models, weights, output_dir)
accelerator.save_state(save_path)
File "/home/hello/anaconda3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2417, in save_state
File "/path/to/project/train_text_to_image.py", line 694, in save_model_hook
weights.pop()
IndexError: pop from empty list
hook(self._models, weights, output_dir)
File "/path/to/project/train_text_to_image.py", line 694, in save_model_hook
weights.pop()
IndexError: pop from empty list
The weights are always empty every time a checkpoint is saved. So should this line be deleted?weights.pop()
@patrickvonplaten I have checked the logic of deepspeed to use
save_state
under it's examples. And found thataccelerator.save_state
not need to under main process by usingif accelerator.is_main_process
. So I update the save code intrain_text_to_image_lora.py
to disable judge logicis_main_process
fromif global_step % args.checkpointing_steps == 0: if accelerator.is_main_process: save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}") accelerator.save_state(save_path) logger.info(f"Saved state to {save_path}")
to
if global_step % args.checkpointing_steps == 0: save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}") logger.info(f"Start to save state to {save_path}") accelerator.save_state(save_path) logger.info(f"Saved state to {save_path}")
And then train and save successed using deepspeed zero-stage = 0 or 2. It's a little confused for that I have seen that
save the model only for the main process when using the distributed training mode
. And it's a little different with deepspeed.Hello, I also faced a similar problem, but I did not use LoRA for training. Your method works, thank you very much. But at the same time, after I modified it according to your method, the following error was prompted:
Traceback (most recent call last): File "/path/to/project/train_text_to_image.py", line 1185, in <module> DEBUG [] main() File "/path/to/project/train_text_to_image.py", line 1085, in main Traceback (most recent call last): save_checkpoint() File "/path/to/project/train_text_to_image.py", line 1048, in save_checkpoint File "/path/to/project/train_text_to_image.py", line 1185, in <module> main() File "/path/to/project/train_text_to_image.py", line 1085, in main save_checkpoint() File "/path/to/project/train_text_to_image.py", line 1048, in save_checkpoint accelerator.save_state(save_path) File "/home/hello/anaconda3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2417, in save_state hook(self._models, weights, output_dir) accelerator.save_state(save_path) File "/home/hello/anaconda3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2417, in save_state File "/path/to/project/train_text_to_image.py", line 694, in save_model_hook weights.pop() IndexError: pop from empty list hook(self._models, weights, output_dir) File "/path/to/project/train_text_to_image.py", line 694, in save_model_hook weights.pop() IndexError: pop from empty list
The weights are always empty every time a checkpoint is saved. So should this line be deleted?
weights.pop()
@kzwang001 Have you solved it? I met the same problem as well.
any updates?
I ran into the same IndexError: pop from empty list
issue with the provided example script train_dreambooth.py
under the ./examples/dreambooth
directory in this repository, so all I ended up doing to fix the issue was to add a check to see if weights
is NOT empty before popping it, and everything ran great after that. So basically in train_dreambooth.py
the save_model_hook
function went from being:
# create custom saving & loading hooks so that `accelerator.save_state(...)` serializes in a nice format
def save_model_hook(models, weights, output_dir):
for model in models:
sub_dir = "unet" if isinstance(model, type(accelerator.unwrap_model(unet))) else "text_encoder"
model.save_pretrained(os.path.join(output_dir, sub_dir))
# make sure to pop weight so that corresponding model is not saved again
weights.pop()
to being:
# create custom saving & loading hooks so that `accelerator.save_state(...)` serializes in a nice format
def save_model_hook(models, weights, output_dir):
for model in models:
sub_dir = "unet" if isinstance(model, type(accelerator.unwrap_model(unet))) else "text_encoder"
model.save_pretrained(os.path.join(output_dir, sub_dir))
# make sure to pop weight so that corresponding model is not saved again
if weights: # Don't pop if empty
weights.pop()
Hope this may help anyone else running into the same issue :)
I guess that, after the accelerator.is_main_process
is commented, the multiple processes will execute save_model_hook
repeatedly. So excute weights.pop()
under the main process should work.
@patrickvonplaten I have checked the logic of deepspeed to use
save_state
under it's examples. And found thataccelerator.save_state
not need to under main process by usingif accelerator.is_main_process
. So I update the save code intrain_text_to_image_lora.py
to disable judge logicis_main_process
fromif global_step % args.checkpointing_steps == 0: if accelerator.is_main_process: save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}") accelerator.save_state(save_path) logger.info(f"Saved state to {save_path}")
to
if global_step % args.checkpointing_steps == 0: save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}") logger.info(f"Start to save state to {save_path}") accelerator.save_state(save_path) logger.info(f"Saved state to {save_path}")
And then train and save successed using deepspeed zero-stage = 0 or 2.
It's a little confused for that I have seen that
save the model only for the main process when using the distributed training mode
. And it's a little different with deepspeed.
it helped, thank you very much. But for some reason it only helped on a fresh server, the old one got to be dropped...
diffusers fixed this error in version "0.28.0.dev0" the same error when I finetuning SDXL with diffusers==0.26.3, but everything works after upgrade diffusers to version "0.28.0.dev0"
@patrickvonplaten I have checked the logic of deepspeed to use
save_state
under it's examples. And found thataccelerator.save_state
not need to under main process by usingif accelerator.is_main_process
. So I update the save code intrain_text_to_image_lora.py
to disable judge logicis_main_process
fromif global_step % args.checkpointing_steps == 0: if accelerator.is_main_process: save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}") accelerator.save_state(save_path) logger.info(f"Saved state to {save_path}")
to
if global_step % args.checkpointing_steps == 0: save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}") logger.info(f"Start to save state to {save_path}") accelerator.save_state(save_path) logger.info(f"Saved state to {save_path}")
And then train and save successed using deepspeed zero-stage = 0 or 2.
It's a little confused for that I have seen that
save the model only for the main process when using the distributed training mode
. And it's a little different with deepspeed.
This is still useful when using old-version diffusers.