Describe the bug

I try to finetune a SDXL model. But meet same questions.

I can not resume from a checkpoint model and the error is shown as follows:

[rank0]: load_checkpoint_in_model( [rank0]: File "/mnt/wangxuekuan/miniconda3/envs/sdxl/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 1637, in load_checkpoint_in_model [rank0]: raise ValueError( [rank0]: ValueError: /mnt/wangxuekuan/finetune/all/sdxl-exp0/checkpoint-10000 is not a folder containing a .index.json file or a pytorch_model.bin or a model.safetensors file

Here is model path: (sdxl) wangxuekuan@ucloud-9:/mnt/wangxuekuan/diffusers/examples/text_to_image$ ls /mnt/wangxuekuan/finetune/all/sdxl-exp0/checkpoint-10000/unet/ config.json diffusion_pytorch_model-00002-of-00002.safetensors diffusion_pytorch_model-00001-of-00002.safetensors diffusion_pytorch_model.safetensors.index.json

Meanwhile, I want to test the checkpoint, is also fail in loading checkpoint. unet = UNet2DConditionModel.from_pretrained(model_path, subfolder="unet") pipe = DiffusionPipeline.from_pretrained(model_path, unet=unet, safety_checker=None)

same error !

Reproduction

infer code: unet = UNet2DConditionModel.from_pretrained(model_path, subfolder="unet") pipe = DiffusionPipeline.from_pretrained(model_path, unet=unet, safety_checker=None)

train shell.

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0" export VAE_NAME="madebyollin/sdxl-vae-fp16-fix" export DATASET_NAME="lambdalabs/naruto-blip-captions" export OUTPUT_DIR="/mnt/wangxuekuan/finetune/all/sdxl-exp0" export RESUME_FROM_CHECKPOINT="/mnt/wangxuekuan/finetune/all/sdxl-exp0/checkpoint-10000" export DATASET_NAME="selected_16" #"/mnt/xys/dataset/character/all_in_one_0419/"

accelerate launch train_text_to_image_sdxl.py
--pretrained_model_name_or_path=$MODEL_NAME
--pretrained_vae_model_name_or_path=$VAE_NAME
--train_data_dir=$DATASET_NAME --caption_column="text"
--resume_from_checkpoint=$RESUME_FROM_CHECKPOINT
--enable_xformers_memory_efficient_attention
--resolution=512 --center_crop --random_flip
--proportion_empty_prompts=0.2
--train_batch_size=1
--gradient_accumulation_steps=4 --gradient_checkpointing
--max_train_steps=1000000
--use_8bit_adam
--learning_rate=1e-06 --lr_scheduler="constant" --lr_warmup_steps=0
--mixed_precision="fp16"
--validation_prompt="a cute Sundar Pichai creature" --validation_epochs 5
--checkpointing_steps=50
--output_dir=$OUTPUT_DIR

--push_to_hub

Logs

No response

System Info

python3.8 diffusers-0.30 A100-80G

Who can help?

No response

Jun 21 '24 12:06 XuekuanWang

If you pull in the latest changes of the repository in your local fork, you should be able to perform inference with the code snippet you mentioned. If not, please provide the error trace.

Jun 21 '24 13:06 sayakpaul

I encountered the same problem. Is there any solution? Thank you very much.

Jul 18 '24 14:07 LJQbiu

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 14 '24 15:09 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Dec 13 '24 15:12 github-actions[bot]

train_text_to_image_sdxl.py fail resume from checkpoint and also can not load for infer

Describe the bug

Reproduction

--push_to_hub

Logs

System Info

Who can help?