Represent learnt concept in textual inversion with more than one token
Describe the bug
As we discuss in #266
The original textual inversion support using more than one vector to represent the learnt concept. For the current implementation, if we just extend the learned vocab and CLIP token embedding then it would use only one vector for it.
What could be the best way to support this? cc @patil-suraj
Reproduction
No response
Logs
No response
System Info
diffusers v0.2.4
cc @patil-suraj - are we planning on supporting this soon or rather not?
Some people [1] claimed that using more than one token to represent the learned concept help improve the inversion performance
(Ctrl-F num_vectors). But I guess to implement that, the current pipeline would change a lot
reddit #1
+1
HI! I implemented this for my project in a bit of a hacky way but I can make a pr for it if you want! Basically, I transformed the placeholder token to n other placeholder tokens combined.
I made a pr for what I mean (here)[https://github.com/huggingface/diffusers/pull/661]! It might make it a bit verbose so I'm thinking it might be better to inherit the tokenizer.
Hey @isamu-isozaki, thanks a lot! How are the results with your approach, this is different from the way it's done in the offical repo no ?
@patil-suraj Hi! Yup the approach is a bit different in how the tokens are assigned but you can get the same result by having one token for initializing token. In my experiments, having more tokens do produce better results. For example, I was just doing fast experiments with lr 5e-4 with batch size 1 and for the default one token example I got something like

as an end result but with say 12 tokens or so, with the same parameters, I got

Here are my wandb runs. I was trying to generate my room mate's dog using 6 photos
I'll double check the original implementation again but for this one, you can have each placeholder token start with a different initial value. So you can describe your subject not just with one token but with a sentence. For example for the 12 token example, I started with the sentence "large white dog and light brownish dog"
The main problem for this implementation is that it makes the prompt quite verbose. For example, for the picture below, the prompt was
"A picture of <frida>_0 <frida>_1 <frida>_2 <frida>_3 <frida>_4 <frida>_5 <frida>_6 <frida>_7 <frida>_8 <frida>_9 <frida>_10 <frida>_11"
Cool, thanks for the explanation!
I'm also thinking about how to support the original implementation in diffusers. Will have something working soon!
@patil-suraj Sounds good! Let me know if you need any help.