Bad Flux lora training result
Hi@kohya-ss , thank you for your detailed and excellent work on FLUX finetuning and lora training!! I got bad results when I ran the sample lora_training script with network_dim=32, input 50 high-quality 1024 * 1024 images, max_train_epochs=50 ( 2500steps in total )
python==3.10.15
torch==2.4.0
torchmetrics ==1.6.0
torchvision==0.19.0
transformers==4.44.0
accelerate==0.33.0
xformers== 0.0.23.post1
diffusers==0.25.0
CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes 1 --main_process_port 23333 \ flux_train_network.py \ --pretrained_model_name_or_path /black-forest-labs/FLUX.1-schnell/flux1-schnell.safetensors \ --clip_l /SD3/text_encoders/clip_l.safetensors \ --t5xxl /SD3/text_encoders/t5xxl_fp16.safetensors \ --ae /black-forest-labs/FLUX.1-schnell/ae.safetensors \ --cache_latents_to_disk \ --save_model_as safetensors \ --sdpa \ --persistent_data_loader_workers \ --max_data_loader_n_workers 2 \ --seed 42 \ --gradient_checkpointing \ --mixed_precision bf16 \ --save_precision bf16 \ --network_module networks.lora_flux \ --network_dim 32 \ --network_train_unet_only \ --optimizer_type adamw8bit \ --learning_rate 1e-4 \ --cache_text_encoder_outputs \ --cache_text_encoder_outputs_to_disk \ --highvram \ --max_train_epochs 50 \ --save_every_n_epochs 1 \ --dataset_config flux_image_50.toml \ --output_dir /flux_unet/log/lora \ --output_name flux-lora-name \ --timestep_sampling shift \ --discrete_flow_shift 3.1582 \ --model_prediction_type raw \ --guidance_scale 1.0 \
I got bad inference result when running
python3 flux_minimal_inference.py --ckp black-forest-labs/FLUX.1-schnell/flux1-schnell.safetensors --clip_l /SD3/text_encoders/clip_l.safetensors --t5xxl /SD3/text_encoders/t5xxl_fp16.safetensors --ae /black-forest-labs/FLUX.1-schnell/ae.safetensors --dtype bf16 --prompt "A small cactus with a happy face in the Sahara desert." --out /flux_unet/log/lora --seed 42 --flux_dtype fp8 --offload --lora /flux_unet/log/lora/flux-lora-name.safetensors;1.0
The comparison between the original FLUX output ( upper ) and the lora-added output (lower) is
whereas my training images are very good (like this)
Can you give me some hints about it ? thank you so much !!!!!
Here are a few things to get you in the right direction...
- It's been my experience that schnell is more prone to having issues in training than dev. I don't know if using dev is an option for you, but if it is, you may want to rerun using that.
- 50 images are too many. Try 20-30 with a consistent concept. Sometimes less. I've done several successful LoRAs with one image. More pictures may take longer to converge.
- With an LR of only 0.0001, more than 2500 steps may be needed, especially with 50 pictures. Typically, I would use an LR of 0.00015 with 20 pictures for 4000 steps. Sometimes, it would converge at 2000, and sometimes, it would need the full run.
- Your sample image is AI-generated and has artifacts. These can be magnified through training. I don't know if more of your training set is AI-generated, but that could be the issue as well.
Check your workflow to see if it includes a shift node. Old flux workflows did not have shift at first, while the training script included shift
Check your workflow to see if it includes a shift node. Old flux workflows did not have shift at first, while the training script included shift
@sdbds the training script includes --timestep_sampling shift and --discrete_flow_shift 3.1582,if i should delete or change the --timestep_sampling to other sampling type? I just find the flux-schnell scheduler has shift 1, i may change the -discrete_flow_shift to 1 ?
Check your workflow to see if it includes a shift node. Old flux workflows did not have shift at first, while the training script included shift检查您的流程是否包含一个移位节点。旧的通量工作流程最初没有移位,而训练脚本中包含了移位
@sdbds the training script includes
--timestep_sampling shiftand--discrete_flow_shift 3.1582,if i should delete or change the --timestep_sampling to other sampling type? I just find the flux-schnell scheduler has shift 1, i may change the-discrete_flow_shiftto 1 ?训练脚本包括--timestep_sampling shift和--discrete_flow_shift 3.1582,我应该删除或更改--timestep_sampling 为其他采样类型吗?我发现 flux-schnell 调度器有偏移 1,我是否可以将-discrete_flow_shift改为 1?
If you train schnell, it should not use any shift. Only dev has shift; you should remove the related shift parameters during training.
Check your workflow to see if it includes a shift node. Old flux workflows did not have shift at first, while the training script included shift检查您的流程是否包含一个移位节点。旧的通量工作流程最初没有移位,而训练脚本中包含了移位
@sdbds the training script includes
--timestep_sampling shiftand--discrete_flow_shift 3.1582,if i should delete or change the --timestep_sampling to other sampling type? I just find the flux-schnell scheduler has shift 1, i may change the-discrete_flow_shiftto 1 ?训练脚本包括--timestep_sampling shift和--discrete_flow_shift 3.1582,我应该删除或更改--timestep_sampling 为其他采样类型吗?我发现 flux-schnell 调度器有偏移 1,我是否可以将-discrete_flow_shift改为 1?If you train schnell, it should not use any shift. Only dev has shift; you should remove the related shift parameters during training.
@sdbds
I just change the training script to
--timestep_sampling uniform \
--discrete_flow_shift 1.0 \
--guidance_scale 0.0 \
however, i still get bad results, can you explain where the shift is ? i think i did not get your idea 「(;´༎ຶД༎ຶ`)」
I am also confused about this -discrete_flow_shift. What is its purpose? How should it be set during training and inference? Can someone explain it? Thank you very much! By the way, it's a good attempt to set the guidance_scale=1.0.
angthing update?