diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

train_text_to_image_sdxl.py fail resume from checkpoint and also can not load for infer

Open XuekuanWang opened this issue 1 year ago • 1 comments

Describe the bug

I try to finetune a SDXL model. But meet same questions.

I can not resume from a checkpoint model and the error is shown as follows:

[rank0]: load_checkpoint_in_model( [rank0]: File "/mnt/wangxuekuan/miniconda3/envs/sdxl/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 1637, in load_checkpoint_in_model [rank0]: raise ValueError( [rank0]: ValueError: /mnt/wangxuekuan/finetune/all/sdxl-exp0/checkpoint-10000 is not a folder containing a .index.json file or a pytorch_model.bin or a model.safetensors file

Here is model path: (sdxl) wangxuekuan@ucloud-9:/mnt/wangxuekuan/diffusers/examples/text_to_image$ ls /mnt/wangxuekuan/finetune/all/sdxl-exp0/checkpoint-10000/unet/ config.json diffusion_pytorch_model-00002-of-00002.safetensors diffusion_pytorch_model-00001-of-00002.safetensors diffusion_pytorch_model.safetensors.index.json

Meanwhile, I want to test the checkpoint, is also fail in loading checkpoint. unet = UNet2DConditionModel.from_pretrained(model_path, subfolder="unet") pipe = DiffusionPipeline.from_pretrained(model_path, unet=unet, safety_checker=None)

same error !

Reproduction

infer code: unet = UNet2DConditionModel.from_pretrained(model_path, subfolder="unet") pipe = DiffusionPipeline.from_pretrained(model_path, unet=unet, safety_checker=None)

train shell.

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0" export VAE_NAME="madebyollin/sdxl-vae-fp16-fix" export DATASET_NAME="lambdalabs/naruto-blip-captions" export OUTPUT_DIR="/mnt/wangxuekuan/finetune/all/sdxl-exp0" export RESUME_FROM_CHECKPOINT="/mnt/wangxuekuan/finetune/all/sdxl-exp0/checkpoint-10000" export DATASET_NAME="selected_16" #"/mnt/xys/dataset/character/all_in_one_0419/"

accelerate launch train_text_to_image_sdxl.py
--pretrained_model_name_or_path=$MODEL_NAME
--pretrained_vae_model_name_or_path=$VAE_NAME
--train_data_dir=$DATASET_NAME --caption_column="text"
--resume_from_checkpoint=$RESUME_FROM_CHECKPOINT
--enable_xformers_memory_efficient_attention
--resolution=512 --center_crop --random_flip
--proportion_empty_prompts=0.2
--train_batch_size=1
--gradient_accumulation_steps=4 --gradient_checkpointing
--max_train_steps=1000000
--use_8bit_adam
--learning_rate=1e-06 --lr_scheduler="constant" --lr_warmup_steps=0
--mixed_precision="fp16"
--validation_prompt="a cute Sundar Pichai creature" --validation_epochs 5
--checkpointing_steps=50
--output_dir=$OUTPUT_DIR

--push_to_hub

Logs

No response

System Info

python3.8 diffusers-0.30 A100-80G

Who can help?

No response

XuekuanWang avatar Jun 21 '24 12:06 XuekuanWang

If you pull in the latest changes of the repository in your local fork, you should be able to perform inference with the code snippet you mentioned. If not, please provide the error trace.

sayakpaul avatar Jun 21 '24 13:06 sayakpaul

I encountered the same problem. Is there any solution? Thank you very much.

LJQbiu avatar Jul 18 '24 14:07 LJQbiu

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Sep 14 '24 15:09 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Dec 13 '24 15:12 github-actions[bot]