compel icon indicating copy to clipboard operation
compel copied to clipboard

Compel to text

Open HatmanStack opened this issue 1 year ago • 3 comments
trafficstars

I'm playing around a bit with compel and HF inference api for long prompts 150 token+. One thing the api expects is text for input so I'm trying to convert cosine similarties between token and text embeddings. Am I headed in the right direction or is this a waste of time? Code:

tokenizer = AutoTokenizer.from_pretrained(item.modelID, subfolder="tokenizer") 
clip = CLIPTextModel.from_pretrained(item.modelID, subfolder="text_encoder")

compel = Compel(tokenizer=tokenizer, text_encoder=clip)
conditioning = compel.build_conditioning_tensor(prompt)
token_embeddings = clip.get_input_embeddings().weight
normalized_token_embeddings = normalize(token_embeddings, dim=1)

# Reshape the conditioning tensor to match the shape of the token embeddings
normalized_conditioning = normalize(conditioning.view(-1, normalized_token_embeddings.shape[1]), dim=1)
cosine_similarities = torch.mm(normalized_conditioning, normalized_token_embeddings.t())

max_similarity_indices = torch.argmax(cosine_similarities, dim=1)
# Convert the token indices back into text
text = tokenizer.batch_decode(max_similarity_indices.tolist(), skip_special_tokens=True)
promptString = " ".join(text)

HatmanStack avatar Jun 15 '24 17:06 HatmanStack

hmm. not sure exactly what you're trying to achieve but i don't think what you're doing will help - the raw input_embedding matrix isn't useful as-is, it needs to be selectively pushed through the whole CLIP encoder (which is what the token_ids do, they index into the input_embedding matrix)

you might find this interesting though - https://github.com/YuxinWenRick/hard-prompts-made-easy . it's a system for simplifying/adjusting prompts by learning more efficient ways of prompting the same thing - eg you can convert a 75 token prompt to a 20 token prompt that produces a similar CLIP embedding. maybe you can use that to optimize your 150 token prompts down to 75.

damian0815 avatar Jun 20 '24 10:06 damian0815

It was stumbling in the dark. The results were lackluster, just a vague semblance to the original prompt. Which is still kind of amazing tbh. I thought investing more time might give me some type of path forward. Your suggestion intuitively seems like it would get better results. Although, my brain keeps itching with ideas about sentence structure and weighting words like in Compel. Anything to get better results than the garbled mess I was working with. Tokens are fun.

HatmanStack avatar Jun 20 '24 10:06 HatmanStack

right, yeah. part of the problem is that CLIP text encoder is basically a black box, and the other part is that the >75 token hack is, well, a hack. in my experience you can get just as good "quality" by tweaking your short prompt (eg with a thesaurus website just try swapping out words for other similar words) than by writing a 150 token prompt

damian0815 avatar Jun 20 '24 16:06 damian0815