diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

(Textual Inversion) Initialise Vector For New Token From Multiple Existing Tokens

Open rsomani95 opened this issue 3 years ago • 4 comments

I'd like to propose an idea analagous to https://github.com/huggingface/diffusers/issues/369.

The current fine tuning script for textual inversion initialises the new placeholder_token's embedding with an existing initializer_token (and enforces that the init token is exactly one token).

https://github.com/huggingface/diffusers/blob/84b9df57a7c78e1cd9c132d286341451a4e6a80b/examples/textual_inversion/textual_inversion.py#L409-L411

I was curious if we could initialise a new token from multiple existing ones. Let me give an example for a use case. Say I'm trying to add the concept of a "low camera angle". The existing model does have some semblance of this concept, but it's far from concrete. However, it's existing knowledge is not captured by any single token in isolation.


My first thought was to get the embeddings of each token from

tokenizer.encode("low camera angle", add_special_tokens=False)

and average them but that doesn't quite smell right. As I understand it, it's the text_encoder that's responsible for relationships between sequences of words. I wonder what the best strategy might be to initialise a new token from multiple existing ones.

Thanks!

cc @patil-suraj @isamu-isozaki

rsomani95 avatar Sep 29 '22 12:09 rsomani95

Also, note that tokenizer.encode only converts the text into integer tokens, it does not give the embeddings.

To get the embeddings we will need to call the token emebedding layer, get embeddings for the tokens and then average them.

tokens = tokenizer.encode("low camera angle", add_special_tokens=False, return_tensors="pt")
embeddings = text_encoder.get_input_embeddings()[tokens]

This will give the embeddings of the tokens which you can then average.

patil-suraj avatar Sep 29 '22 13:09 patil-suraj

@rsomani95 Hi! Yup, that's interesting. One way I tackled this problem in /pull/661 was I increased the number of placeholder tokens to equal or exceed the number of tokens in the initializer token so that if you put in a white dog, it'll assign one placeholder token to white and one placeholder token to dog.

I haven't tested with the averaging strategy but I'm not sure if it'll do well. For example, in the previous example, white and dog would most likely have a way different embedding so adding them and averaging them might just land on some completely unrelated concept like car or just nothing at all.

isamu-isozaki avatar Sep 29 '22 13:09 isamu-isozaki

@patil-suraj Thanks for the clarification. Didn't know re. Discord, will join and continue the conversation there.

@isamu-isozaki I saw your PR, and I think that's a great idea. Will try it out and share any interesting findings (or not). Re. averaging embeddings -- I think that makes sense, maybe a more straightforward path is to just add more tokens and train for longer.

rsomani95 avatar Sep 29 '22 15:09 rsomani95

@rsomani95 Sounds good! Let me know if you find anything interesting

isamu-isozaki avatar Sep 29 '22 16:09 isamu-isozaki

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Oct 29 '22 15:10 github-actions[bot]

Should be better solved by: https://github.com/huggingface/diffusers/pull/3144 BTW

patrickvonplaten avatar Apr 20 '23 10:04 patrickvonplaten