train_text_to_image_sdxl.py fail resume from checkpoint and also can not load for infer
Describe the bug
I try to finetune a SDXL model. But meet same questions.
I can not resume from a checkpoint model and the error is shown as follows:
[rank0]: load_checkpoint_in_model(
[rank0]: File "/mnt/wangxuekuan/miniconda3/envs/sdxl/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 1637, in load_checkpoint_in_model
[rank0]: raise ValueError(
[rank0]: ValueError: /mnt/wangxuekuan/finetune/all/sdxl-exp0/checkpoint-10000 is not a folder containing a .index.json file or a pytorch_model.bin or a model.safetensors file
Here is model path: (sdxl) wangxuekuan@ucloud-9:/mnt/wangxuekuan/diffusers/examples/text_to_image$ ls /mnt/wangxuekuan/finetune/all/sdxl-exp0/checkpoint-10000/unet/ config.json diffusion_pytorch_model-00002-of-00002.safetensors diffusion_pytorch_model-00001-of-00002.safetensors diffusion_pytorch_model.safetensors.index.json
Meanwhile, I want to test the checkpoint, is also fail in loading checkpoint. unet = UNet2DConditionModel.from_pretrained(model_path, subfolder="unet") pipe = DiffusionPipeline.from_pretrained(model_path, unet=unet, safety_checker=None)
same error !
Reproduction
infer code: unet = UNet2DConditionModel.from_pretrained(model_path, subfolder="unet") pipe = DiffusionPipeline.from_pretrained(model_path, unet=unet, safety_checker=None)
train shell.
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0" export VAE_NAME="madebyollin/sdxl-vae-fp16-fix" export DATASET_NAME="lambdalabs/naruto-blip-captions" export OUTPUT_DIR="/mnt/wangxuekuan/finetune/all/sdxl-exp0" export RESUME_FROM_CHECKPOINT="/mnt/wangxuekuan/finetune/all/sdxl-exp0/checkpoint-10000" export DATASET_NAME="selected_16" #"/mnt/xys/dataset/character/all_in_one_0419/"
accelerate launch train_text_to_image_sdxl.py
--pretrained_model_name_or_path=$MODEL_NAME
--pretrained_vae_model_name_or_path=$VAE_NAME
--train_data_dir=$DATASET_NAME --caption_column="text"
--resume_from_checkpoint=$RESUME_FROM_CHECKPOINT
--enable_xformers_memory_efficient_attention
--resolution=512 --center_crop --random_flip
--proportion_empty_prompts=0.2
--train_batch_size=1
--gradient_accumulation_steps=4 --gradient_checkpointing
--max_train_steps=1000000
--use_8bit_adam
--learning_rate=1e-06 --lr_scheduler="constant" --lr_warmup_steps=0
--mixed_precision="fp16"
--validation_prompt="a cute Sundar Pichai creature" --validation_epochs 5
--checkpointing_steps=50
--output_dir=$OUTPUT_DIR
--push_to_hub
Logs
No response
System Info
python3.8 diffusers-0.30 A100-80G
Who can help?
No response
If you pull in the latest changes of the repository in your local fork, you should be able to perform inference with the code snippet you mentioned. If not, please provide the error trace.
I encountered the same problem. Is there any solution? Thank you very much.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.