sd-scripts Flux Lora training seems not to converge with big dataset(140 images)

I use one NVIDIA L40S(48GB VRAM) to train a Lora for Flux, and here is my training script: ./sd-scripts/flux_train_network.py --pretrained_model_name_or_path ./model/flux1-dev.safetensors --clip_l ./model/clip_l.safetensors --t5xxl ./model/t5xxl_fp16.safetensors --ae ./model/ae.safetensors --cache_latents_to_disk --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --gradient_checkpointing --mixed_precision bf16 --full_bf16 --save_precision bf16 --network_module networks.lora_flux --network_dim 64 --network_alpha 32 --learning_rate 1 --lr_scheduler cosine_with_restarts --lr_scheduler_num_cycles 1 --optimizer_type prodigy --network_train_unet_only --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --highvram --max_train_steps 3000 --save_every_n_steps 500 --dataset_config ./dataset.toml --output_dir ./lora_weight --output_name flux-lora-demo --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 --loss_type l2 --t5xxl_max_token_length 512 --min_snr_gamma 5 --sample_every_n_steps 500 --sample_prompts ./sample_prompt.toml --sample_sampler euler_a --logging_dir ./logs --log_with all --log_tracker_name flux_lora_komoru --wandb_api_key xxxxxxxxxxxxxxxx

And my training dataset includes 140 images with below dataset.toml configuration:

[general]
enable_bucket = true
caption_extension = '.txt'
keep_tokens = 0

# DreamBooth caption based character datasets
[[datasets]]
resolution = 1024
min_bucket_reso = 640
max_bucket_reso = 1536
bucket_reso_steps = 32
batch_size = 4

[[datasets.subsets]]
image_dir = './dataset'

The final loss/average rate is around 0.38xxxx which is much higher than the nearly same configuration(less batch_size and less steps) with a small dataset(12 images, 0.08xxx loss/average rate).

Any advice for my training configurations? Thanks.

PS: I am not quite sure that min_snr_gamma=5 would work well for the Flux training ,but it seemed to be a little bit improvement for the convergence.

Aug 21 '24 23:08 terrificdm

I have the same problem as you

Aug 22 '24 01:08 Yukinoshita-Yukinoe

In my case, it converges even with a data set of 3,000 images. How about using adamw8bit as the optimizer, a learning rate of 1e-3, and network_alpha=1?

Aug 22 '24 03:08 kohya-ss

Just modified optimizer_type=adamw8bit, lr=1e-3 and network_alpha=1 as you mentioned above, even though the training is still in processing, but it looks not promising. Just a little bit improvement...

The blue line is the modified training result.

Aug 22 '24 07:08 terrificdm

Looks like I'm having the same issue #1464 Training with 326 images + 2669 reg images. My loss rate has been around 0.435 for all 5 epochs. Are you using reg images by any chance and if so, how many?

Aug 22 '24 07:08 Tophness

Do we know if bucketing works with flux yet? That could explain it

Aug 22 '24 07:08 Tophness

Looks like I'm having the same issue #1464 Training with 326 images + 2669 reg images. My loss rate has been around 0.435 for all 5 epochs. Are you using reg images by any chance and if so, how many?

No, I didn't use reg images.

Aug 22 '24 08:08 terrificdm

Do we know if bucketing works with flux yet? That could explain it

It worked.

Aug 22 '24 08:08 terrificdm

Do we know if bucketing works with flux yet? That could explain it

It worked.

I meant because neither of us could reach convergence and we both use bucketed resolutions

Aug 22 '24 08:08 Tophness

Do we know if bucketing works with flux yet? That could explain it

It worked.

I meant because neither of us could reach convergence and we both use bucketed resolutions

I see. But bucketed resolutions worked well with small dataset, and it converged.

Aug 22 '24 08:08 terrificdm

In my case, it converges even with a data set of 3,000 images. How about using adamw8bit as the optimizer, a learning rate of 1e-3, and network_alpha=1?

The final convergence didn't improve too much after I changed parameters as you mentioned. The blue line used adamw8bit, lr=1e-3 and network_alpha=1.

Screenshot 2024-08-22 at 22 58 34

And even though the convergence was not optimal as I expected, the quality of generated images with trained Lora were acceptable. @kohya-ss I am curious about your Lora training result(e.g. loss/average rate) with 3,000 images dataset.

Aug 22 '24 15:08 terrificdm

In my case, it converges even with a data set of 3,000 images. How about using adamw8bit as the optimizer, a learning rate of 1e-3, and network_alpha=1?

The final convergence didn't improve too much after I changed parameters as you mentioned. The blue line used adamw8bit, lr=1e-3 and network_alpha=1.

And even though the convergence was not optimal as I expected, the quality of generated images with trained Lora were acceptable. @kohya-ss I am curious about your Lora training result(e.g. loss/average rate) with 3,000 images dataset.

I'm mostly astonished that the loss is so low, the lowest I've ever seen is about 0.6

Aug 22 '24 17:08 enoblegas

And even though the convergence was not optimal as I expected, the quality of generated images with trained Lora were acceptable. @kohya-ss I am curious about your Lora training result(e.g. loss/average rate) with 3,000 images dataset.

batch size=2, dim(rank)=4, alpha=1, optimizer adamw8bit, learning rate 5e-4, constant scheduler, with --network_args "loraplus_unet_lr_ratio=4"

Although the loss did not decrease significantly, the model had a tendency to overfit, so I stopped training early (only one epoch).

lora_plus may speed up convergence considerably.

Aug 22 '24 22:08 kohya-ss

And even though the convergence was not optimal as I expected, the quality of generated images with trained Lora were acceptable. @kohya-ss I am curious about your Lora training result(e.g. loss/average rate) with 3,000 images dataset.

batch size=2, dim(rank)=4, alpha=1, optimizer adamw8bit, learning rate 5e-4, constant scheduler, with --network_args "loraplus_unet_lr_ratio=4"

Although the loss did not decrease significantly, the model had a tendency to overfit, so I stopped training early (only one epoch).

lora_plus may speed up convergence considerably.

Got it, Thanks a lot. Looks like the final loss of training cannot be decreased to a low number even though the model is converged.

Aug 23 '24 02:08 terrificdm

In my case, it converges even with a data set of 3,000 images. How about using adamw8bit as the optimizer, a learning rate of 1e-3, and network_alpha=1?

The final convergence didn't improve too much after I changed parameters as you mentioned. The blue line used adamw8bit, lr=1e-3 and network_alpha=1.

And even though the convergence was not optimal as I expected, the quality of generated images with trained Lora were acceptable. @kohya-ss I am curious about your Lora training result(e.g. loss/average rate) with 3,000 images dataset.

It turned out that Flux might need more resources for training to achieve convergence with a decent loss rate.

Below I used multi GPU training(4XL40s) for 3500 steps with batch_size=4, 4X4X3500(steps) for 140 images ...

Screenshot 2024-08-25 at 10 54 55

Aug 25 '24 03:08 terrificdm

PS: I am not quite sure that min_snr_gamma=5 would work well for the Flux training ,but it seemed to be a little bit improvement for the convergence

Have you attempted training both with and without min_snr_gamma being specified? If so, did you find that specifying it produced better or worse results?

Aug 26 '24 03:08 setothegreat

PS: I am not quite sure that min_snr_gamma=5 would work well for the Flux training ,but it seemed to be a little bit improvement for the convergence

Have you attempted training both with and without min_snr_gamma being specified? If so, did you find that specifying it produced better or worse results?

Tried with and without min_snr_gamma before, it had a little bit improvement for convergence. But it didn't show a strong impact to the convergence as it did in sdxl training.

Aug 26 '24 13:08 terrificdm

Just modified optimizer_type=adamw8bit, lr=1e-3 and network_alpha=1 as you mentioned above, even though the training is still in processing, but it looks not promising. Just a little bit improvement...

The blue line is the modified training result.

Can you share which tool used in the pic to monitoring the loss ,thanks?

Sep 05 '24 03:09 seasnakes

Just modified optimizer_type=adamw8bit, lr=1e-3 and network_alpha=1 as you mentioned above, even though the training is still in processing, but it looks not promising. Just a little bit improvement...

The blue line is the modified training result.

Did you change network_alpha from 16 to 1 while keeping network_dim the same at 64?This seems to cause the strength of the weights to be too low (1/64th of what it would be if the alpha and dim were the same), and that's one thing I'm confused about, in SD1.5, SDXL training we use 16, 32, 64 or even larger network_dim, but in kohya's parameters, the 3000 quantity dataset also just sets this value to 4, I would have thought that it would need to be a little bit higher？ @kohya-ss

Sep 05 '24 10:09 ScilenceForest

I would have thought that it would need to be a little bit higher？ @kohya-ss

It depends on the dataset and the quality you want. FLUX.1 has many dimensions and layers, so even low-rank LoRA is quite large.

From my understanding, changing alpha does not change the strength of the final LoRA. With small alpha, LoRA weights are scaled down in application, but LoRA weights are trained to have larger value.

Sep 05 '24 12:09 kohya-ss

It depends on the dataset and the quality you want. FLUX.1 has many dimensions and layers, so even low-rank LoRA is quite large.

From my understanding, changing alpha does not change the strength of the final LoRA. With small alpha, LoRA weights are scaled down in application, but LoRA weights are trained to have larger value.

Thank you for your explanation. I also want to know what happens when alpha is greater than rank. I found network_dim=2, and alpha=16 in civitai's flux default training parameters.

Sep 05 '24 12:09 ScilenceForest

Thank you for your explanation. I also want to know what happens when alpha is greater than rank. I found network_dim=2, and alpha=16 in civitai's flux default training parameters.

According to the original LoRA paper, adjusting alpha is roughly the same as adjusting the learning rate. Therefore, if alpha=16 and dim=2, a fairly high learning rate would be required, but if we adjust the learning rate properly, LoRA will be able to learn without any problems.

Sep 05 '24 12:09 kohya-ss

According to the original LoRA paper, adjusting alpha is roughly the same as adjusting the learning rate. Therefore, if alpha=16 and dim=2, a fairly high learning rate would be required, but if we adjust the learning rate properly, LoRA will be able to learn without any problems.

Thank you again for your patient answer

Sep 05 '24 13:09 ScilenceForest

Just modified optimizer_type=adamw8bit, lr=1e-3 and network_alpha=1 as you mentioned above, even though the training is still in processing, but it looks not promising. Just a little bit improvement... The blue line is the modified training result.

Can you share which tool used in the pic to monitoring the loss ,thanks?

wandb, the scripts of Kohya has already integrated with it.

Sep 07 '24 11:09 terrificdm

I trained 500 images for 5000steps, and the loss is around 3.1, so in order to lower the loss to 0.08, I then trained this lora with another 5000steps-8000steps. Compared to the loras saved from the 1000-5000steps, I must say the quality of the 10000 steps lora is much better, but the loss is almost the same, from 3.16 to 3.14.

Is it something about distilled model or some structure we don't reveal?

Sep 13 '24 17:09 RaySteve312

@terrificdm The base model here is a de-distilled version of Flux, which may learn more effectively than DEV. https://huggingface.co/bdsqlsz/flux1-dev2pro-single

Technical Explanation https://medium.com/@zhiwangshi28/why-flux-lora-so-hard-to-train-and-how-to-overcome-it-a0c70bc59eaf

Sep 26 '24 14:09 waomodder

@terrificdm The base model here is a de-distilled version of Flux, which may learn more effectively than DEV. https://huggingface.co/bdsqlsz/flux1-dev2pro-single

Technical Explanation https://medium.com/@zhiwangshi28/why-flux-lora-so-hard-to-train-and-how-to-overcome-it-a0c70bc59eaf

Wow, interesting paper, will check it out. Thanks.

Sep 26 '24 23:09 terrificdm

Try training with this one https://huggingface.co/nyanko7/flux-dev-de-distill This model is true de-distilled it disable Distilled CFG Scale, with this model I was able to train multiple subjects in one lora,11 in total no bleeding in in subjects of the same clase, the captions are simple "name man" "name man" "name woman" etc, the result was perfect, with regular flux-dev is imposible to train multiple subjects of the same class without bleeding, regular flux-dev suffers also from catastrophic forgetting because is a distilled model, flux-dev-de-distill seems to solve all this issues, is supported by kohya but image sampling is not supported the sample images looks deformed because it use real cfg scale 1, it works fine on cfg 3.5, my last and definitive test is using regularization images that were nos supported by regular FLUX-DEV because it mixed all concepts, then you can use the trained lora on regular flux-dev because using cfg 1 and distilled cfg is faster on inference, with respect to the loss graph this is mine, but the quality is perfect, may be 0.3 is good number for flux, idk Captura de pantalla 2024-10-12 165750

Oct 12 '24 19:10 dsienra

Try training with this one https://huggingface.co/nyanko7/flux-dev-de-distill This model is true de-distilled it disable Distilled CFG Scale

Isn't that what this one was supposed to be, only with 3 million images instead of 150K? https://huggingface.co/bdsqlsz/flux1-dev2pro-single

is supported by kohya but image sampling is not supported the sample images looks deformed because it use real cfg scale 1, it works fine on cfg 3.5

can we get a cfg_scale parameter for sampling then?

Oct 14 '24 18:10 Tophness

And even though the convergence was not optimal as I expected, the quality of generated images with trained Lora were acceptable. @kohya-ss I am curious about your Lora training result(e.g. loss/average rate) with 3,000 images dataset.

batch size=2, dim(rank)=4, alpha=1, optimizer adamw8bit, learning rate 5e-4, constant scheduler, with --network_args "loraplus_unet_lr_ratio=4"

Although the loss did not decrease significantly, the model had a tendency to overfit, so I stopped training early (only one epoch).

lora_plus may speed up convergence considerably.

Have you tried larger batch size and larger lr? I have a 8-A100 80G set up so I can go very large batch size. E.g. if I use 8*8 batch size, do I still get good result with a large lr? You suggested to use 1e-3 for bs=4, so I suppose I should use 1e-2 for such high batch size?

I've trained multiple lora on SDXL, and found the loss never decrease significantly no matter how I alter the lr. When I tried to use the lora I trained to inference, the result is also mixed. I'm running a Flux training task right now with a very large dataset, and the loss is decrease from like 0.35+ to 0.34, then never go down much. I'm not sure if I'm using the right lr, or is it simply bad to use large bs in lora training.

Dec 27 '24 18:12 atodniAr