sd-scripts icon indicating copy to clipboard operation
sd-scripts copied to clipboard

Flux textual_inversion

Open aiXander opened this issue 1 year ago • 18 comments

Is anybody already working on making a script for Flux Ti or wants to start working on one (I'm down to jump in!)?

Some thoughts:

  • The code from my SDXL LoRA trainer here has good templates to start porting functionality from
  • The tricky part will be getting T5 in memory since its so huge...
  • worst case scenario you have to do a Ti phase first (no gradients through unet) and then a LoRA phase after that

Another research project I've been thinking about:

  • try to train a model that can one-shot estimate the token embedding of a 'thing', 'face' or 'character' by just looking at example imgs of it. Kind of like IP-adapter, except directly into token space. This would enable actually prompting with that thing
  • There could still be a LoRA sitting on top to improve fine details, but a reasonable token embedding to start from might help a great deal.

aiXander avatar Sep 10 '24 15:09 aiXander

This maybe work: https://github.com/Littleor/textual-inversion-script, but it require a large VRAM, I'm in the process of realizing a TI training script using 24GB VRAM.

Littleor avatar Sep 13 '24 05:09 Littleor

This would be a really nice feature to have. Currently, using multiple LoRAs in the same image causes some visible output degradation (screen door effect), unless we reduce the strength of the LoRAs.

However some concepts for which we currently have to use LoRAs may easily trained into embeddings, which shouldn't cause such degradation since DiT weights wouldn't be touched. A well trained embedding could potentially work even better than in SD models, given the powerful T5 encoder.

@kohya-ss Are there any plans to support textual inversion in the codebase? Yesterday I started an attempt to adapt the SDXL version but its been a bit of a struggle since Flux requires some significant changes. I would gladly provide assistance on this effort where I can.

recris avatar Sep 13 '24 15:09 recris

I have just implemented FLUX.1 dev Textual Inversion within 20G VRAM. After completing training and testing, I will open the code, which may be helpful.

Littleor avatar Sep 13 '24 16:09 Littleor

@kohya-ss Are there any plans to support textual inversion in the codebase? Yesterday I started an attempt to adapt the SDXL version but its been a bit of a struggle since Flux requires some significant changes. I would gladly provide assistance on this effort where I can.

I think that TI training doesn't work on SDXL because I did a big refactoring on sd3 branch. I will make TI training on SDXL work first, so please wait for a while.

kohya-ss avatar Sep 14 '24 04:09 kohya-ss

I have just implemented FLUX.1 dev Textual Inversion within 20G VRAM. After completing training and testing, I will open the code, which may be helpful.

@kohya-ss Now I have implemented the Textual Inversion training for the FLUX.1 dev model on a 24GB VRAM GPU, which may provide some help in implementing our codebase: https://github.com/Littleor/textual-inversion-script?tab=readme-ov-file#low-vram-usage.

What's more, the TI training for SDXL is working in this code: https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion_sdxl.py, I hope this can be helpful.

Littleor avatar Sep 14 '24 15:09 Littleor

I have just implemented FLUX.1 dev Textual Inversion within 20G VRAM. After completing training and testing, I will open the code, which may be helpful.

Awesome, is this doing Ti on both T5 and CLIP?

aiXander avatar Sep 14 '24 21:09 aiXander

I have just implemented FLUX.1 dev Textual Inversion within 20G VRAM. After completing training and testing, I will open the code, which may be helpful.

Awesome, is this doing Ti on both T5 and CLIP?

Yes, this training on both T5 and CLIP.

Littleor avatar Sep 14 '24 22:09 Littleor

Good thought! And we may discuss zeroshot-fewshot-manyshot ways for T2I.

IMO, the ways can be attributed with their learning shot and granularity.

ipadapter is the zeroshot one with coarse granularity. Particularly, face-id ipa needs much more higher granularity cause person face is a really high granular task. Face-id is worse than instand-id or pulid-id which ship higher granularity.

lora is the fewshot one with finer granular. Higher the rank goes, better the details go. However worse generalization may be a problem if the shots is too low. Follow the scaling law, you need high rank lora along, manyshot images and curated captions in the same time to lift your result.

full finetuning is the boss method theoretically but not practically. It provides highest granularity controling. But few people possess the high quality and large amount data it requires(And GPUs). We just find too many burned/washed-out finetuning T2I checkpoints at CIVITAI.

controlnet(other than ipa) is somehow a side path. It fixed the input modality. thus the scaling law is cut down thus we can train with much lower effort than base model. But it actually has more coarse granularity than lora or finetuning, you just don't want to make a "hand-fixer controlnet" etc.

IMO we may just want to sort out the conditions (text, vision embedding, vision modality map, hybrid information in unknown custom data...) to build better and universal paradigm for next generation of T2I. The flux ecology is a good start. But we are calling for lowshot methods in the future for sure. IMO lora is the one better than the others. People need to steer the T2I tasks in their private domain with no new model.

I believe flux is more suitable for text inversion than any other models for sub-id tasks. It emerges.

sipie800 avatar Sep 18 '24 02:09 sipie800

@Littleor I've tested your Ti training repo but havent had any successes (it wont learn my concept at all), is it possible there are bugs left in the implementation or did it work on your end?

aiXander avatar Sep 26 '24 12:09 aiXander

Littleor

Hi, thanks for implementing this! but I cannot find the code now, could you please share it 😭 and may I ask your training time for textual inversion, mine got stuck and is extremely slow

LilyDaytoy avatar Oct 18 '24 20:10 LilyDaytoy

This maybe work: https://github.com/Littleor/textual-inversion-script, but it require a large VRAM, I'm in the process of realizing a TI training script using 24GB VRAM.这可能有效:https://github.com/Littleor/textual-inversion-script,但它需要一个大 VRAM,我正在使用24GB VRAM 实现 TI 训练脚本。

Hi, thanks for implementing this! but I cannot find the code now, could you please share it 😭

DeepSleepCode avatar Dec 02 '24 09:12 DeepSleepCode

This maybe work: https://github.com/Littleor/textual-inversion-script, but it require a large VRAM, I'm in the process of realizing a TI training script using 24GB VRAM.这可能有效:https://github.com/Littleor/textual-inversion-script,但它需要一个大 VRAM,我正在使用24GB VRAM 实现 TI 训练脚本。

Hi, thanks for implementing this! but I cannot find the code now, could you please share it 😭

Hi, this repository has been hidden because I found that it did not meet my expectations. It may be that directly performing Textual Inversion on FLUX is not as effective as on SD. I am trying to train more things simultaneously to make Text Inversion effective, and it works.

Littleor avatar Dec 04 '24 09:12 Littleor

@Littleor I've tested your Ti training repo but havent had any successes (it wont learn my concept at all), is it possible there are bugs left in the implementation or did it work on your end?

I'm sorry for the late reply. I also suspect there may be implementation issues, but after checking several times, I couldn't find any problems. So I tried training some FLUX layers and it worked. At the moment, I tentatively understand that Textual Inversion is not as effective on FLUX as it is on SD.

Littleor avatar Dec 04 '24 09:12 Littleor

@Littleor I've tested your Ti training repo but havent had any successes (it wont learn my concept at all), is it possible there are bugs left in the implementation or did it work on your end?

I'm sorry for the late reply. I also suspect there may be implementation issues, but after checking several times, I couldn't find any problems. So I tried training some FLUX layers and it worked. At the moment, I tentatively understand that Textual Inversion is not as effective on FLUX as it is on SD.

hi. thanks for your reply. "So I tried training some FLUX layers and it worked. " ==>>Do you mean that you trained some double stream layers in flux and freeze the t5 and clip word embdedding. in this way flux can learn new concept?

DeepSleepCode avatar Dec 04 '24 09:12 DeepSleepCode

@Littleor I've tested your Ti training repo but havent had any successes (it wont learn my concept at all), is it possible there are bugs left in the implementation or did it work on your end?

I'm sorry for the late reply. I also suspect there may be implementation issues, but after checking several times, I couldn't find any problems. So I tried training some FLUX layers and it worked. At the moment, I tentatively understand that Textual Inversion is not as effective on FLUX as it is on SD.

hi. thanks for your reply. "So I tried training some FLUX layers and it worked. " ==>>Do you mean that you trained some double stream layers in flux and freeze the t5 and clip word embdedding. in this way flux can learn new concept?

Hi, thanks for your reply. I just train some attention layers and freeze the t5 and clip word embdedding (only new token's word embdedding can be train as in text inversion). I found FLUX can learn some new concept in this way, and it will be not work if only train with new token embdedding.

Littleor avatar Dec 06 '24 05:12 Littleor

I just stumbled on this Blog Post that also mentioned pure Textual Inversion Training. I tried it three times, but I didn't get it to learn my concept :/

Sebastian-Zok avatar Dec 13 '24 17:12 Sebastian-Zok

@Littleor Hi, now I am exploring effective flux id consistency solution, do you think text inversion is a good way to achieve strong consistency? Traditional lora training needs heavy computation

lieding avatar Feb 26 '25 02:02 lieding

@Littleor Hi, now I am exploring effective flux id consistency solution, do you think text inversion is a good way to achieve strong consistency? Traditional lora training needs heavy computation

Text Textual Inversion may be a good method, as it can learn many new concepts, such as color, style, and objects. However, to my knowledge, directly using Textual Inversion in FLUX may not work as this issue shown. I haven't been focusing on this issue recently, so I'm sorry that I can't confirm.

Littleor avatar Feb 26 '25 07:02 Littleor

maby this version for deep floid t5 could be used for flux as well? https://github.com/oss-roettger/T5-Textual-Inversion

Manni1000 avatar May 05 '25 19:05 Manni1000