sd-scripts icon indicating copy to clipboard operation
sd-scripts copied to clipboard

Flux Dedistilled / fluxdev2pro support ?

Open Tablaski opened this issue 1 year ago • 40 comments

I'm trying to use the amazing new Dedistilled models with the trainer

If you haven't tried them, they are groundbreaking : https://civitai.com/models/843551 For me it's the biggest thing in the Flux community since we're able to train LoRas.

They would allow training with CFG > 1 (Guidance > 1) thus probably allowing much better caption adherence during training / possible better prompt adherence later on

(Although it is not sure we can properly use a LoRa trained with CFG > 1 with distilled models. But if we can it would probably be amazing, that is why we need to try ASAP)

Currently I've just tried to replace flux1dev.sft by another file in the following parameter

--pretrained_model_name_or_path "C:\fluxgym\models\unet\flux1-dev.sft"

But I got this error which I haven't really investigated yet. I have the same using fluxdev2pro which is a fine-tuned dedistilled model enhancing training :

File "C:\fluxgym\sd-scripts\flux_train_network.py", line 519, in trainer.train(args) File "C:\fluxgym\sd-scripts\train_network.py", line 354, in train model_version, text_encoder, vae, unet = self.load_target_model(args, weight_dtype, accelerator) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\fluxgym\sd-scripts\flux_train_network.py", line 82, in load_target_model model = self.prepare_split_model(model, weight_dtype, accelerator) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\fluxgym\sd-scripts\flux_train_network.py", line 127, in prepare_split_model flux_upper.to(accelerator.device, dtype=target_dtype) File "C:\fluxgym\env\Lib\site-packages\torch\nn\modules\module.py", line 1340, in to return self._apply(convert) ^^^^^^^^^^^^^^^^^^^^ File "C:\fluxgym\env\Lib\site-packages\torch\nn\modules\module.py", line 900, in _apply module._apply(fn) File "C:\fluxgym\env\Lib\site-packages\torch\nn\modules\module.py", line 900, in _apply module._apply(fn) File "C:\fluxgym\env\Lib\site-packages\torch\nn\modules\module.py", line 927, in _apply param_applied = fn(param) ^^^^^^^^^ File "C:\fluxgym\env\Lib\site-packages\torch\nn\modules\module.py", line 1333, in convert raise NotImplementedError( NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device. Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in run_code File "C:\fluxgym\env\Scripts\accelerate.exe_main.py", line 7, in File "C:\fluxgym\env\Lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main args.func(args) File "C:\fluxgym\env\Lib\site-packages\accelerate\commands\launch.py", line 1174, in launch_command simple_launcher(args) File "C:\fluxgym\env\Lib\site-packages\accelerate\commands\launch.py", line 769, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError:

Tablaski avatar Oct 16 '24 09:10 Tablaski

Maybe you should try to pull the latest version? It works okay on my computer with the dedistilled model.

Ice-YY avatar Oct 16 '24 13:10 Ice-YY

Really ? that's great. May I asked what differences have you noticed training stuff with it ?

How much CFG did you set ? Did it worked on distilled models with CFG = 1 afterwards ?

How were your results using the LoRa, both on distilled and dedistilled ?

Tablaski avatar Oct 16 '24 15:10 Tablaski

You could check this disscusion: https://huggingface.co/nyanko7/flux-dev-de-distill/discussions/3 Dsienra is conducting some tests on this model and posting feedback.

Ice-YY avatar Oct 17 '24 02:10 Ice-YY

@Ice-YY thank you, I didn't see that discussion on nyanko7's hugginface

This is extremely interesting. Still, have you tried yourself ?

Tablaski avatar Oct 17 '24 09:10 Tablaski

I've been training a LoRA model on a dataset with several distinct art styles, each with its own unique trigger word. When training on the base model, the output doesn’t change much regardless of which trigger word I use for the different art styles. However, when I train the LoRA model using the de-distilled base model, it does show some ability to differentiate between the trigger words, although the results are still not ideal when compared to training LoRA on SDXL. For now I'm experimenting with training using a larger (like 6.0) guidance scales to see if I can improve the results.

Ice-YY avatar Oct 17 '24 12:10 Ice-YY

Ok so you've tried with guidance 1 using dedistilled for the moment ? Then did you generate images with it back with distilled ?

I am very curious to know if a LoRa trained on dedistilled with a guidance >= 4 (I mean not just 1.5 or 2) would work with distilled flux, meaning it is backward compatible

Tablaski avatar Oct 17 '24 16:10 Tablaski

Ok so you've tried with guidance 1 using dedistilled for the moment ? Then did you generate images with it back with distilled ?

Yes. And now I can confirm that LoRa trained on distilled guidance 6.0 works fine with the distilled Flux.

Ice-YY avatar Oct 18 '24 00:10 Ice-YY

I've had incredible early success with this model in combination with ademix8bit. In 326 steps it achieved the bulk of what it took 40,000 to get to on distilled adamw8bit. I used cfg of 1 and lr 1e-4 for training, then default cfg in forge for inference.

Tophness avatar Oct 18 '24 01:10 Tophness

This very good news then, I get that training with de-distilled + guidance >= 1 improves prompt adherence when back with distilled models and that they're able to use Distilled guidance as usual. For LoRas at least (i'm currently asking questions to someone who has just finetuned a checkpoint, will update here).

Tablaski avatar Oct 18 '24 08:10 Tablaski

I changed line 152 in library/flux_train_utils: scale = prompt_dict.get("scale", 1.0) but the images are still completely distorted. I'm guessing the problem is the model itself. Maybe you'd need to switch to a non-distilled model for inference.

Tophness avatar Oct 20 '24 02:10 Tophness

Just thought to add my experience: On the latest pull of the SD3 branch, I can train on DevDedistilled. It's detected as a Schnell model because of the missing guidance blocks(line 77 in library/flux_utils.py): 'is_schnell = not ("guidance_in.in_layer.bias" in keys or "time_text_embed.guidance_embedder.linear_1.bias" in keys) and initialized as such. Training works but that's possibly why samples are broken? Forcing it to be detected as a dev model breaks it which makes sense. I imagine it might need some special consideration for full support. Unlike Dev2Pro, DevDedistilled fully removes the distilled guidance blocks.

Sarania avatar Oct 26 '24 14:10 Sarania

Just thought to add my experience: On the latest pull of the SD3 branch, I can train on DevDedistilled. It's detected as a Schnell model because of the missing guidance blocks(line 77 in library/flux_utils.py): 'is_schnell = not ("guidance_in.in_layer.bias" in keys or "time_text_embed.guidance_embedder.linear_1.bias" in keys) and initialized as such. Training works but that's possibly why samples are broken? Forcing it to be detected as a dev model breaks it which makes sense. I imagine it might need some special consideration for full support. Unlike Dev2Pro, DevDedistilled fully removes the distilled guidance blocks.

@kohya-ss I apologize for the interruption, but I am also quite curious about this situation. Could you please help explain what issues might arise from classifying a de-distilled model as 'Schnell'? Are there any plans to support de-distilled models in the future?

leonary avatar Nov 13 '24 04:11 leonary

@kohya-ss I apologize for the interruption, but I am also quite curious about this situation. Could you please help explain what issues might arise from classifying a de-distilled model as 'Schnell'? Are there any plans to support de-distilled models in the future?

Even if the de-distilled model is classified as a schnell, I think the training might be done correctly. Sample generation requires a negative prompt (CFG) in the de-distilled model, but it is not yet supported. That's why I think sample generation breaks. I'd like to add CFG in the sample generation as soon as I have time.

kohya-ss avatar Nov 21 '24 12:11 kohya-ss

Are there any best practices for training with de-distilled models cause this is all one big jungle tbh. Would be awesome to have a collection of best practices / experiments / knowledge sharing...

For example, some questions I have:

  1. There are three de-distilled models that I'm aware of:
  • https://huggingface.co/nyanko7/flux-dev-de-distill
  • https://huggingface.co/ashen0209/Flux-Dev2Pro
  • https://huggingface.co/InstantX/flux-dev-de-distill-diffusers --> are these all roughly equivalent?
  1. When traininig with a de-distilled model, should you still train with "--guidance_scale = 1.0"?
  2. What if you want to train a LoRA using de-distilled, but then run that LoRA on top of FLUX dev/schnell?
  3. What about LoRA ranks? I currently see 4 being the default? Any reason to go higher?
  4. There is also some interesting work on using better CLIP-l models: https://github.com/zer0int/ComfyUI-Long-CLIP which seems worth testing for training also
  5. Currently, for single object/face LoRA's I'm getting the best results when training without any captions at all, but I am using a fairly detailed trigger word that visually describes the thing I'm trying to learn. There's a lot of confusing information on this topic: https://civitai.com/articles/7203/captions-vs-no-captions-a-deep-dive-into-effects-on-flux-lora-training
  6. Right now I'm testing out masked training, which I remember made a huge difference in my SD15 / SDXL trainers (https://github.com/edenartlab/sd-lora-trainer)

aiXander avatar Dec 17 '24 23:12 aiXander

Are there any best practices for training with de-distilled models cause this is all one big jungle tbh. Would be awesome to have a collection of best practices / experiments / knowledge sharing...

For example, some questions I have:

  1. There are three de-distilled models that I'm aware of:
  • https://huggingface.co/nyanko7/flux-dev-de-distill
  • https://huggingface.co/ashen0209/Flux-Dev2Pro
  • https://huggingface.co/InstantX/flux-dev-de-distill-diffusers --> are these all roughly equivalent?
  1. When traininig with a de-distilled model, should you still train with "--guidance_scale = 1.0"?
  2. What if you want to train a LoRA using de-distilled, but then run that LoRA on top of FLUX dev/schnell?
  3. What about LoRA ranks? I currently see 4 being the default? Any reason to go higher?
  4. There is also some interesting work on using better CLIP-l models: https://github.com/zer0int/ComfyUI-Long-CLIP which seems worth testing for training also
  5. Currently, for single object/face LoRA's I'm getting the best results when training without any captions at all, but I am using a fairly detailed trigger word that visually describes the thing I'm trying to learn. There's a lot of confusing information on this topic: https://civitai.com/articles/7203/captions-vs-no-captions-a-deep-dive-into-effects-on-flux-lora-training
  6. Right now I'm testing out masked training, which I remember made a huge difference in my SD15 / SDXL trainers (https://github.com/edenartlab/sd-lora-trainer)
  1. https://huggingface.co/InstantX/flux-dev-de-distill-diffusers is just the diffusers version of https://huggingface.co/nyanko7/flux-dev-de-distill - it's a fully de-distilled model, with the distilled guidance blocks removed. It can be used for both training and inference and there are also finetunes of it. For inference, it uses classifier free guidance (CFG) instead of distilled guidance and thus takes about twice as long per step. Quality is slightly higher than base Flux dev and prompt adherence with CFG is tighter, some LORA work better too. Edit: Oh and de-distilled natively supports negative prompts since it uses CFG but using adaptive guidance and perpendicular negative you can get negative with distilled flux too.
  • https://huggingface.co/ashen0209/Flux-Dev2Pro is only meant for training. It still contains distilled guidance blocks, but it has been trained to perform better with guidance_scale = 1 and thus be suitable as a training scaffold(i.e. you would train your LORA on Dev2Pro with the intent of using it on regular Dev).

  • So in summary, Dev2Pro is just a retrained FluxDev meant to perform better specifically for use during training, while dev-dedistilled is a true attempt to recreate Flux Pro both for training and inference.

  1. All the tests that I've seen, and my own, have done so. Edit: Outdated, see below

  2. That's quite often the idea!

  3. Definitely! I usually use between 16 and 32 depending on what I'm training. Generally as the complexity of what you're training goes up and the number of concepts you want the LORA to contain go up, the rank needs to go up too. There's no hard and fast rules but I've seen 4-8 recommended for style/anime stuff and simpler stuff like faces, 16-32 for full character multiple outfits/complex real world stuff. I've had my very best results at 24 and 32 but I don't tend to train super simple stuff. Late edit: I checked a LOT of other people's LORAs out that I use with inspector and they are largely rank 2 - 16, with most of them 4. Also a LOT of them have alpha > dim i have several with dim 4 alpha 16???

  4. In my experience Long CLIP slightly improves the quality of images created during inference(better shadows/details). I tried a while back to test it for training but at that time, kohya's scripts didn't like it. Dunno if that's changed.

  5. "There's a lot of confusing information on this topic" - Indeed. Regardless of what anyone says, if something works well for you then go for it. I've gotten some of my best results with adafactor despite trying much newer stuff like adamwschedulefree... weird as that is. I train with detailed prose format captions personally, with custom highly unique tags to mark the main concepts I wanna be able to reproduce. I've not tried without captions so I can't speak to that.

  6. I've been meaning to try out masking as well, please report how it goes!

Sarania avatar Dec 19 '24 15:12 Sarania

Thank you Sanaria! I can now confirm that adding masks significantly improves results by maintaining better promptability. I'm currently using very basic prompt-based CLIPseg masks from my SDXL trainer which work great! https://github.com/edenartlab/sd-lora-trainer/blob/main/trainer/preprocess.py#L168 (I simply added these masks as alpha channel into the training imgs and then activated --alpha_mask )

I've gotten great results training LoRA's on top of flux-dev-de-distill, going to test training on top of Flux-Dev2Pro now to see if I notice any differences.

One thing I'm still struggling with is training multiple concepts / faces / ... into a single LoRA, is that something you've succesfully done? Eg for faces I get a ton of bleeding (like most people report). I believe one main reason is that the kohya flux trainer doesn't support textual inversion which really is that way you'd wanna do multiple concepts.. I started a thread about Flux TI here with some good conversation, but I don't think anyone has cracked it yet...

In fact, there seems to be a lot of great experimentation going on in diffusers with TI that I havent tried yet, anyone have experience with these training scripts? https://huggingface.co/blog/linoyts/new-advanced-flux-dreambooth-lora

aiXander avatar Dec 19 '24 16:12 aiXander

Thank you Sanaria! I can now confirm that adding masks significantly improves results by maintaining better promptability. I'm currently using very basic prompt-based CLIPseg masks from my SDXL trainer which work great! https://github.com/edenartlab/sd-lora-trainer/blob/main/trainer/preprocess.py#L168 (I simply added these masks as alpha channel into the training imgs and then activated --alpha_mask )

I've gotten great results training LoRA's on top of flux-dev-de-distill, going to test training on top of Flux-Dev2Pro now to see if I notice any differences.

One thing I'm still struggling with is training multiple concepts / faces / ... into a single LoRA, is that something you've succesfully done? Eg for faces I get a ton of bleeding (like most people report). I believe one main reason is that the kohya flux trainer doesn't support textual inversion which really is that way you'd wanna do multiple concepts.. I started a thread about Flux TI here with some good conversation, but I don't think anyone has cracked it yet...

In fact, there seems to be a lot of great experimentation going on in diffusers with TI that I havent tried yet, anyone have experience with these training scripts? https://huggingface.co/blog/linoyts/new-advanced-flux-dreambooth-lora

The one model I've trained that contained multiple faces they absolutely did bleed together. This was acceptable as it was meant to be a training of different facial imperfections and just more interesting faces than base Flux. I've never tried to train multiple /distinct/ faces I imagine that would be very tricky maybe reg images would help?

I will definitely give masking a shot then, I keep hearing really good things. I tried a few trains on Dev2Pro but that was so early on in my Flux experience that I don't think the data is relevant(it didn't go well, but nothing did at first XD). I haven't tried training on de-distilled yet except for an aborted run but for inference it's definitely useful. Sometimes a prompt I can't get to work in dev will work in de-distilled and vice versa and LORAs are often more effective(sometimes TOO effective) with CFG(you can also use low CFG with distilled Flux, especially if adaptive and this has given me my best results! E.g. Distilled Guidance ~ 3.8, CFG ~1.4 for 50-100% of steps.) I've recently just been running tests at night changing up various parameters and seeing what works best. Most recently I've been experimenting with Huber loss but my results are mixed - better in some areas worse in others.

It's still the wild west for Flux I think but it's super exciting to see all the developments that are happening!

Sarania avatar Dec 19 '24 21:12 Sarania

Thanks for you guys previous answer, adding my 2cents here :

I now always train my LoRas using nyanko7's de-distilled model and --guidance_scale = 4.0 No issue to report when using them with distilled models. I haven't trained higher guidance_scale like 6.0, I'm sure it would work. I just don't see the point of using a de-distilled training model with guidance_scale = 1

I have never used Fluxdev2pro. I thought it was a de-distilled or partly de-distilled model. If you can elaborate further, I would appreciate. I know for sure it's a finetune with 3 million pictures but I don't know what it's good at or not. The article that presented it shown example of fantasy so it must be better for that, but what about faces, NSFW, etc ?

I have experience using masking training, I have written articles on civitai about using segmentAnything to generate masks : https://civitai.com/articles/9000/segmentanything-create-masks-for-lora-training-or-img2img https://civitai.com/articles/8974/training-non-face-altering-loras-full-workflow

Next step is I want to test SAM2 to see if the automated mask generation is even better. https://github.com/neverbiasu/ComfyUI-SAM2

I love masked training.

What is annoying though is (correct me if I'm wrong as I didn't update my Kohya scripts to the latest version) you don't have any way of telling the masks have been properly loaded by Kohya unless you add loggers yourself. Which I did, refer to my articles for some help about it.

Please note however I recently found out that even when masking all the faces in a dataset while training it, the LoRa will perform better ALONE. As soon as you bring other LoRas in, it performance decreases (a lot if faces were trained in concurrent Lora, not much if faces were masked in concurrent Lora). I'm now making a second img2img pass to inpaint again the face with the face Lora on its own, and then the results are excellent.

Multi-concepts I think should be avoided as much as possible, I think de-distilled with guidance 4 helps doing it, but it's more effective to do several LoRas if possible.

Now the technique I really want to try is finetuning the whole flux model with the training dataset, then extracting the LoRa from the resulting checkpoint. According to people that have done it, it's not longer and much more effective.

Tablaski avatar Dec 20 '24 09:12 Tablaski

It makes sense that dev-dedistilled could benefit from guidance_scale > 1, I wondered about that when I mentioned I'd never seen anyone try it above. Have you compared directly to guidance_scale = 1 and seen improvement?

Ashen's Flux dev2pro was specifically retrained to perform better with guidance_scale = 1. This was before dev-dedistilled existed, and it was recognized that we're forced to train Flux Dev with guidance_scale = 1 because of it's distilled nature, while at the same time Flux Dev performs REALLY badly with guidance_scale = 1. It was thought that by training it to perform better in this specific scenario, we could get better results by training a LORA on Dev2Pro for use on regular Dev, just like how you might train a LORA on base SDXL to use on Turbo or one of the distilled SDXL models. It's only meant to be a training scaffold, it performs worse than normal Flux for all scenarios EXCEPT guidance_scale = 1. I've heard good things about it but when I tried it I was just starting out and nothing really worked yet, so I can't speak to more than the theoretical.

Dev-dedistilled came later and goes even further by modifying the model's architecture and actually removing the distilled guidance entirely from the model as well as retraining it to work with CFG. It's a more complete attempt at recreating Flux Pro from Flux Dev. Personally I think Dev-dedistilled supersedes Dev2Pro for most purposes since it's a more full conversion and is also highly useful for inference as well. However Dev2Pro does remain closer to the original Flux model, so there might be some benefit to that.

I think you can do multiconcept okay if the concepts are related enough. I've had luck with that a couple of different times. Highly unique tokens help a lot here and there will be some bleeding of concept but if it's a dataset that can tolerate that, it's definitely doable. However training multiple distinct faces in the same LORA would be a nightmare I agree. Batch size is definitely a factor with multiconcept stuff, increasing it seems to make it harder for the model to latch on to individual concepts which makes intuitive sense.

Sarania avatar Dec 20 '24 15:12 Sarania

Woo so for the last few days I've been running test after test after test on my best dataset, changing only a single variable each time. After investigating other's LORAs above and discovering none were larger than Rank 16 I second guessed myself and did two back to back runs with the only difference being one was R16 and one R24. It's a complicated dataset and the R24 was clearly superior so yeah, increasing Rank above 16 can definitely still benefit. In fact with Adafactor + batch_size = 2, the R24 was my best LORA to that point(I also tested adamw8bit constant, cosine with restarts)

But the big win was last night - trained against Dev-dedistilled using guidance_scale = 4 with everything else the same. Wow! What a huge win! For inference both on regular Flux and dedistilled this produced far and away the best output I've gotten by a huge margin. Prompt adherence is MUCH tighter, quality is much higher, just... it's so much better. HIGHLY RECOMMEND.

Edit: I wanted to test if guidance_scale was actually affecting things since the model is being detected as Schnell and such, investigating the code left me uncertain and so I trained two small rank 8 LORA on one of my smaller datasets with the only difference being one had guidance_scale = 1.0 and the other had = 4.0. Guidance is /absolutely/ working and it does it's job - the output looks much more like the training input. But where my above dataset was superb quality, this one is subpar. I actually like the guidance_scale 1 LORA images more, the higher guidance pushed more of the quality issues from the dataset into the model. I'd say by rough estimate, 2.0 might be a good choice in this case. So it's worth tweaking it but yeah absolutely use it!

Sarania avatar Dec 22 '24 13:12 Sarania

I personally use guidance 4 and rank 32 systematically Maybe guidance 6 could be useful for dataset involving several concepts

Tablaski avatar Dec 23 '24 11:12 Tablaski

Hi! I wonder if the Flux Trainer workflow for Flux on comfyui will work for de-distilled model lora training? Shall I use a new node with real cfg guidance to train it correctly? When I use the workflow with cfg=4 it looks fine, but idk if a correct workflow will make it better.

seedclaimer avatar Dec 23 '24 15:12 seedclaimer

And thank you for the excellent information! It's been very helpful for me in learning about Flux Lora.

Based on your sharing and what I've learned elsewhere, I've summarized the following training tips. Please correct me if I'm wrong:

  1. For dev and dev2pro models, it's recommended to use training cfg 1.(I don't know the precise meaning of cfg=1 during training. Does it completely disable something, or does it simply reduce the influence of the guidance?)
  2. For de-distilled models, it's recommended to use training cfg values from 1 to 6. Higher cfg values make the model more responsive to prompts, and increase its similarity to the dataset. However, this also places higher demands on dataset quality.
  3. When using training cfg=1, captions are useless.
  4. For generating, use cfg around 3.5, no matter what cfg the lora was trained on.
  5. For generating, use dev and de-distilled. Dev2pro is generally not advised.
  6. De-distilled requires true cfg guidance, rather than the distilled guidance used for Flux dev, to generate correctly. (Is this same when training?). And it needs more steps to infer.(60 is good)
  7. Alpha masking works well.

seedclaimer avatar Dec 23 '24 15:12 seedclaimer

Let me preface this by saying there are two types of guidance at play here. CFG e.g. classifier free guidance is the same kind we all know and love from SDXL, SD, etc. Flux Pro (the paywalled version) also uses CFG. Flux Dev was distilled to use a more lightweight "FluxGuidance" to make it more approachable on consumer hardware. When you are inferencing with a normal Flux model you are generally using FluxGuidance and the image will break without it because of the distillation. You can also use CFG in LOW amounts but this will double the time per step(in exchange for slightly improved quality and negative prompt). Flux Dev dedistilled tries to return the model to the FluxPro state. It uses CFG only, takes longer per step and needs more step but has higher quality, prompt adherence, etc and it will break the image without CFG. Lastly, FluxGuidance is an inference only type of guidance.

So in short for inference: Normal Flux - Requires FluxGuidance, CFG optional and must be kept very low(<1.4 w/o adaptive), lighter weight and great quality Dedistilled Flux - Requires CFG, FluxGuidance not available, heavier weight but better quality and tighter control

For training, when we set --guidance_scale on the command-line, from my experiments and investigation that's applying CFG and that's why it generally needs to be 1 for normal Flux but is useful for Dedistilled. However the sample prompt generation is currently using FluxGuidance, so that's why samples are broken for Dedistilled. I looked into implementing CFG for sample generation and it's a bit beyond me, even though I am experienced with Python and AI.

So in short for training: Normal Flux - use --guidance_scale = 1.0, samples will work correctly Dedistilled Flux - use --guidance_scale > 1.0 with the caveats you mentioned taken into account. As of Dec 23, 2024, samples will be broken so might as well not use them.

  1. Not so sure about that personally. I always train with captions but I've not compared to without I'd be interested to see comparison.

  2. Not really. Normal Flux can use FluxGuidance from ~2.0-7.0 at least, and CFG of up to say 1.4 without issues(use perpendicular negative if you want negative with normal Flux or CFG will break your images!). Flux De-distilled can use CFG 2.0-7.0 as well but no FluxGuidance. More of either guidance = more strict prompt adherence but lower quality and diversity. As far as how the LORA's training method affects this... I find LORA trained on Dedistilled with CFG benefit from lower FluxGuidance when using normal Flux models but otherwise the same rules apply. Both types of guidance should be tweaked depending on how your output is going. I tend to stick around 3.5-4.5 for either type!

I hope this is all helpful!

Sarania avatar Dec 23 '24 20:12 Sarania

@Sarania Thanks a lot! Your explanation about the differences between distill guidance and CFG, the actual use of CFG in flux trainer, and flux's tolerance for CFG < 1.4 has really clarified many of my doubts.

I have tried training under CFG = 1 with and without captions, but on two slightly different datasets. I didn’t observe significant differences. However, there are some articles discussing this.(https://civitai.com/articles/7203/captions-vs-no-captions-a-deep-dive-into-effects-on-flux-lora-training)

I might go take a look at differences between CFG and flux guidance in terms of inference/training code and network structure and tryout some more training. If I find anything, I’ll report back here.

seedclaimer avatar Dec 24 '24 13:12 seedclaimer

After a lot more testing i'll share some of my current results / opinions:

  • getting better results training on top of flux-dev-de-distill-diffusers vs Fluxdev2pro (inference works really well with both Flux-Dev and Flux-Schnell!)
  • training with cfg=4.0 + proper captions (through GPT4-vision) did not improve my inference results, I'm still getting best results by training with cfg=1.0 and just a single, descriptive trigger text as the prompt for all training imgs.
  • using --lr_scheduler cosine_with_restarts with --lr_scheduler_num_cycles 3 improved results for me
  • Tested --min_snr_gamma=5.0 which may have improved results a tiny bit (from this paper)
  • also adding --noise_offset 0.1and --ip_noise_gamma 0.1 seemed to help a bit
  • adding proper masks def helps a lot. Currently using CLIPSeg masks which are very "leaky", need to experiment with eg Florence2 + SAM2 with near pixel perfect masks.
  • currently doing about 2000-4000 training steps at lr=0.5e-4 with lora_rank = 8 (my datasets are roughly 6 - 30 imgs, so very small!)

I'm def interested in experimenting more with cfg>1.0 training + proper captions cause it feels like thats the way forward here. Also, I'm curious if anyone has tried integrating FLUX Redux into a training pipeline? Essentially, Redux encodes an input image into the LLM token space, creating tokens that represent the image. So theoretically, it should be possible to do that for many training images, somehow average the token embeddings and use that instead of a manually chosen "trigger_text" as a basis for training a LoRA on top..

There are some tricky issues here though:

  • its not obvious how to average multiple Redux encodings since they are all multiple token sequences, and just averaging them in token space is prob not going to work (you really want to average their outputs at the LLM conditioning level that goes into the denoiser)
  • The Redux token sequence is very long, creating a very strong conditioning. This may hinder generalization of the model (eg making a cartoon version of a person etc). But there has been some great experimentation with modulating the redux conditioning.

I'm also very interested to play with OminiControl training as it may eventually replace LoRA's alltogether.

aiXander avatar Dec 27 '24 11:12 aiXander

Thanks for your insights.

I'm surprised you didn't find substancial inferencing improvements when training with CFG > 1 I will keep on training with de-distilled CFG = 4 because I found some, but then I didn't test A vs B rigourously. Anyway, my point is it's definitely not worse, so every little bit helps.

I'm starting attemps to finetune then extracts which is supposed to make a big difference, but I don't have currently the rights settings for my 16 GB VRAM card. I have out of memory.

@aiXander What benefit do you think adding Florence2 on top of SAM2 would bring ?

I don't really get your part about FLUX Redux, it seems like you want to use it kinda like a VLM but the output wouldn't be a text prompt, right ?

I've never used Redux nor OmniControl yet.

Tablaski avatar Dec 27 '24 12:12 Tablaski

As far as general training settings go (and other things I've learned), my experiments have shown at least for me and my datasets:

  • "--min_snr_gamma 5.0" - possibly slightly helpful, definitely not worse
  • "--multires_noise_discount 0.3 --multires_noise_iterations 6" - BIG improvement in quality of smaller/fine details
  • "--noise_offset 0.1" - seems to be a small help
  • "--timestep_sampling sigmoid --discrete_flow_shift 3.0" produces overall better results than "--timestep_sampling shift --discrete_flow_shift 3.1582"
  • Increasing batch size definitely helps with quality and generalization. If low on VRAM, block swap can help! I've been using a batch size of 2 and that's definitely gotten me better results than a batch size of 1. You might have to boost your LR a little as batch size increases, depending on the optimizer.
  • Speaking of LR, it varies a lot. Small simple datasets I can get away with as high as 5e-4 for as little as 500 steps. Large, complex datasets benefit from as low as 2e-5 for many thousands of steps. There's a lot of interplay between other hyperparameters and LR.
  • At least when training on Dev-dedistilled(which is the only way I train now), I've been using network_alpha==network_dim and that seems to be better than alpha at half dim but I haven't tested rigorously.
  • Training the T5XXL produced worse results than identical training without it. Maybe with very explicit, highly verbose captions or edge cases it could be useful
  • Block Swap is AMAZING. I have a 4070 TI SUPER with 16GB VRAM which is just BARELY enough normally if I stick to lower memory optimizers and lighter settings. But with block swap I can go much further, using more advanced optimizers, training the freaking T5XXL if I want. The performance hit is there but at least on my system, it's not that bad.
  • Dataset quality is paramount

I haven't been running a lot of Flux tests recently because I got distracted by Hunyuan video and generative video in general XD

Sarania avatar Dec 27 '24 17:12 Sarania

Thanks for your insights.

I'm surprised you didn't find substancial inferencing improvements when training with CFG > 1 I will keep on training with de-distilled CFG = 4 because I found some, but then I didn't test A vs B rigourously. Anyway, my point is it's definitely not worse, so every little bit helps.

I'm starting attemps to finetune then extracts which is supposed to make a big difference, but I don't have currently the rights settings for my 16 GB VRAM card. I have out of memory.

@aiXander What benefit do you think adding Florence2 on top of SAM2 would bring ?

I don't really get your part about FLUX Redux, it seems like you want to use it kinda like a VLM but the output wouldn't be a text prompt, right ?

I've never used Redux nor OmniControl yet.

Florence2 can generate bounding boxes based on a prompt and SAM2 is SOTA for box-based segmentation. So combining these two is the best way to generate pixel-perfect masks based on a textual description of the face/object/... you're training on.

Redux would be a potential way to do some kind of instant, training-free textual inversion, where you end up with token embeddings that already capture most of the visual look of what you're training on. But because the tokens are in LLM space, they are potentially more generalizeable / promptable. This would prob be non-trivial to integrate into kohya though

aiXander avatar Dec 27 '24 19:12 aiXander

@Sarania is it possible to use --min_snr_gamma 5.0 option? flux_train_network.py not support that option for me

SlZeroth avatar Feb 20 '25 10:02 SlZeroth