diffusers Represent learnt concept in textual inversion with more than one token

Describe the bug

As we discuss in #266

The original textual inversion support using more than one vector to represent the learnt concept. For the current implementation, if we just extend the learned vocab and CLIP token embedding then it would use only one vector for it.

What could be the best way to support this? cc @patil-suraj

Reproduction

No response

Logs

No response

System Info

diffusers v0.2.4

Sep 06 '22 08:09 Luvata

cc @patil-suraj - are we planning on supporting this soon or rather not?

Sep 13 '22 16:09 patrickvonplaten

Some people [1] claimed that using more than one token to represent the learned concept help improve the inversion performance (Ctrl-F num_vectors). But I guess to implement that, the current pipeline would change a lot reddit #1

Sep 21 '22 03:09 Luvata

+1

Sep 22 '22 16:09 patrickvonplaten

HI! I implemented this for my project in a bit of a hacky way but I can make a pr for it if you want! Basically, I transformed the placeholder token to n other placeholder tokens combined.

Sep 25 '22 14:09 isamu-isozaki

I made a pr for what I mean (here)[https://github.com/huggingface/diffusers/pull/661]! It might make it a bit verbose so I'm thinking it might be better to inherit the tokenizer.

Sep 28 '22 03:09 isamu-isozaki

Hey @isamu-isozaki, thanks a lot! How are the results with your approach, this is different from the way it's done in the offical repo no ?

Sep 28 '22 15:09 patil-suraj

@patil-suraj Hi! Yup the approach is a bit different in how the tokens are assigned but you can get the same result by having one token for initializing token. In my experiments, having more tokens do produce better results. For example, I was just doing fast experiments with lr 5e-4 with batch size 1 and for the default one token example I got something like

media_images_samples_12059_dcfd44f36abc51083f13

as an end result but with say 12 tokens or so, with the same parameters, I got

media_images_samples_12028_77a0c2cac01cb55a13c1

Here are my wandb runs. I was trying to generate my room mate's dog using 6 photos

Sep 28 '22 16:09 isamu-isozaki

I'll double check the original implementation again but for this one, you can have each placeholder token start with a different initial value. So you can describe your subject not just with one token but with a sentence. For example for the 12 token example, I started with the sentence "large white dog and light brownish dog"

The main problem for this implementation is that it makes the prompt quite verbose. For example, for the picture below, the prompt was

"A picture of <frida>_0 <frida>_1 <frida>_2 <frida>_3 <frida>_4 <frida>_5 <frida>_6 <frida>_7 <frida>_8 <frida>_9 <frida>_10 <frida>_11"

Sep 28 '22 16:09 isamu-isozaki

Cool, thanks for the explanation! I'm also thinking about how to support the original implementation in diffusers. Will have something working soon!

Sep 29 '22 12:09 patil-suraj

@patil-suraj Sounds good! Let me know if you need any help.

Sep 29 '22 13:09 isamu-isozaki