peft
peft copied to clipboard
Prompt-Tuning for text-to-image diffusion models
Hi, I have been looking for a simple example/script that shows how I can use the prompt-tuning technique in the PEFT library to fine-tune the text encoder of a stable diffusion model. But I could not find any. Could you please introduce me if there is already one? If there is no implementation, I would appreciate any help/available resources for fine-tuning the text encoder with or without the PEFT library. Thanks!
I'm not an expert on stable diffusion, but AFAIK, there is no special handling required to fine-tune the text encoder when it comes to PEFT itself. You can use LoRA or any of the other techniques that are implemented. In case the text encoder is using OpenClip or a similar architecture, you'll have to work based on the branch from #1324, as the MultiheadAttention layer is being used and the PR to support it is not merged yet.
When it comes to details like datasets and objectives fro training the text encoder, this is outside my domain and you'll have better chance looking at how other folks fine-tune the text encoder.
Thank you for sharing your knowledge/experience and that branch. So, from what I understood, it seems the only implementation under the umbrella of PEFT methods is LoRA for the CLIP text encoder in stable diffusion (not Prompt-tuning, P-tuning, and Prefix-tuning). Please correct me if I'm wrong.
Also, do you have any plan, at this time or in the future, to support the other three methods under the PEFT methods (Prompt-tuning, P-tuning, and Prefix-Tuning) for stable diffusion (or equally for its CLIP text encoder as those methods are related to the text input prompt) similar to what has already been implemented for the domain of LLMs?
You should be able to use prompt learning techniques such as prompt-tuning too. What I meant is that methods not based on prompt learning, such as LoRA, IA³, BOFT, etc. cannot be used on MultiheadAttention layers, but for LoRA, there is a branch that implements it.
Yes, I got it and I have also read this discussion 761, and thank you for the great contribution on that.
However, what matters now for me is that: I want to know whether there is any prompt-tuning implementation (or any simple example would suffice) that shows how to do the prompt-tuning in the peft library to fine-tune the text encoder existing in the stable diffusion pipeline (e.g. CompVis/stable-diffusion-v1-4). More specifically, I know that the peft implementation gives me several TaskTypes here to fine-tune several types/categories of language models. But, honestly, as I am not an expert in language models, I am not sure the text-encoder in the diffusion pipeline (which is the CLIP) lies in which TaskTypes are mentioned above. So, as I could not find any resources/implementation on that, I am looking for a simple example to fine-tune the CLIP of the diffusion pipeline using the existing implementation of the peft library. I hope that I asked my question more clearly now.
Unfortunately, I also never came across a use case to fine tine the LM of a SD model and there are no examples I'm aware of. Note that TaskType is optional, so even if your task is not listed, you can still use PEFT. If you have an existing example of fine-tuning the LM part of a SD model and want to adapt it to PEFT, that would be very helpful. I could check that and see what needs to be changed.
Okay, it will be very helpful that at least there is a way of doing such a use case with PEFT. Thank you for putting time into this case and I will be waiting for your update. Please also let me know if you need more info or anything from my side.
Hi @BenjaminBossan, I wanted to kindly know if there is any update on this issue. Thanks!
As an update: I need to do something similar to the following simple script using the PEFT library but I'm not sure what task type and what other changes need to be made in this script:
from peft import get_peft_model, PromptTuningConfig, TaskType
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
peft_config = PromptTuningConfig(
task_type=TaskType.SEQ_2_SEQ_LM,
num_virtual_tokens=5,
token_dim=768,
num_layers=12,
tokenizer_name_or_path="CompVis/stable-diffusion-v1-4"
)
# Apply PEFT to the model
model = get_peft_model(pipe.text_encoder, peft_config)
Note that you don't need to indicate a task type if the task you're training does not correspond to any of the existing ones. As to the rest, it really depends on the data you have, the training objective, etc. If you have an existing example that you want to modify to use PEFT, you can share it here and I can check.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.