InvokeAI icon indicating copy to clipboard operation
InvokeAI copied to clipboard

[enhancement]: 77 maximum tokens?

Open Neosettler opened this issue 2 years ago • 33 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Contact Details

No response

What should this feature add?

Hello, any specific reason why there is a hard cap of the maximum number of tokens?

merge_embeddings.py line 28 max_length=77,

One would think tokens would be discarded based on relevance but they are simply trimmed after the 77th token.

Alternatives

No response

Aditional Content

No response

Neosettler avatar Nov 23 '22 23:11 Neosettler

CLIP does not support more than that.

n00mkrad avatar Nov 24 '22 07:11 n00mkrad

the limit have been lifted here:

max_position_embeddings (int, optional, defaults to 77) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

I read somewhere that this change have been reverted for unknown reason, might worth investigating.

Neosettler avatar Nov 24 '22 19:11 Neosettler

This likely won't be ported due to political/PR reasons - The CLIP token limit removal is based on leaked NovelAI knowledge.

n00mkrad avatar Nov 24 '22 19:11 n00mkrad

Yikes, I didn't know that. Well, cant erased the fact that it is possible. Having max tokens capped to 77 due to political/PR reasons seems to me like a crime against progress.

Neosettler avatar Nov 25 '22 21:11 Neosettler

Could have sworn that this was a toggle on A1111 from before the NovelAI situation... hmm

psychedelicious avatar Nov 26 '22 07:11 psychedelicious

Could have sworn that this was a toggle on A1111 from before the NovelAI situation... hmm

99% sure that's not the case, it was added just days after the leak, if not on the same

n00mkrad avatar Nov 26 '22 14:11 n00mkrad

99% sure that's not the case, it was added just days after the leak, if not on the same

Yeah, the dates on the commit match. My mistake

psychedelicious avatar Nov 26 '22 20:11 psychedelicious

so... unfortunate circumstances for this particular case aside, could we skip to the part where this feature would be a great addition or there is no other way to implement this without asking for trouble?

Neosettler avatar Nov 26 '22 20:11 Neosettler

I had a look at the implementation here: https://github.com/huggingface/diffusers/blob/main/examples/community/lpw_stable_diffusion.py#L209

My initial take after reading this is: this is something that wouldn't be too difficult to add. With or without diffusers doesn't make much difference, everyone is using transformers.CLIPTextModel to turn text to CLIP embeddings, and all the weighting and blending InvokeAI does means it is already messing with that stuff anyway.

It is not quite as simple as pretending the token limit doesn't exist. You still can only send that many through CLIPTextModel at one time. but apparently if you feel like it, you can do that a couple times and concatenate them all together and send the whole mess to the diffusion model?

I think the biggest unknowns will be:

  • Like giving the diffusion model latents that aren't 512 (encoded to 64) square, if you change the dimensions on it, it might do weird shit. In this case we're changing the dimensions of the embeddings instead of the latents, to something different than what it was trained on.
  • When preparing those embeddings in chunks, it might matter how you divide up those chunks. The most obvious case is if you have a multi-token word, I have a strong suspicion that it won't be represented very well if it gets chopped in half and the first syllable is at the end of one embedding and the rest of it is in the next.

keturn avatar Dec 02 '22 17:12 keturn

Personal recommendation: this isn't something that should be done transparently by default.

First, focus on some UI features to make it more visible how InvokeAI is interpreting your prompt. e.g. what the positive and negative prompts are (conditioning and un-conditioned), whether it is splitting and blending multiple prompts, etc. This information is currently available in a very dense syntax-heavy format if you use log-tokenization in the TUI, but is invisible to the Web UI, and you only know about it after you start things generating.

Then we can build on that and show if your long prompt is split in to multiple embeddings, where it is split, etc.

keturn avatar Dec 02 '22 17:12 keturn

Would love to hear ideas on how we move closer to this, and what the implications are on the prompt. We’re contemplating the prompt crafting UX and think this is going to have to play mightily into it

hipsterusername avatar Dec 04 '22 22:12 hipsterusername

Novel has stolen from SD first, why do we even care. The cat is out of the bag anyway.

Seedmanc avatar Dec 13 '22 09:12 Seedmanc

Novel has stolen from SD first, why do we even care. The cat is out of the bag anyway.

SD is Open Source, you can't steal from it. NovelAI is not.

n00mkrad avatar Dec 13 '22 10:12 n00mkrad

Is open-source allowed to be used for profit though? I recall many licenses prohibit that kind of thing.

Seedmanc avatar Dec 13 '22 11:12 Seedmanc

Yes. For example, InvokeAI is MIT licensed and can (and is) used to build for profit commercial products

hipsterusername avatar Dec 13 '22 12:12 hipsterusername

Given that NAI themselves describe their technique in a public blog post, it should be entirely possible to implement without referencing any "tainted" code, making the political/PR reasons argument moot.

feffy380 avatar Dec 16 '22 12:12 feffy380

@feffy380 they don’t seem to provide any code?

At this point, we’re sufficiently convinced it would be ok to implement so long as we have something TO reference in implementing (that isn’t autos code)

hipsterusername avatar Dec 16 '22 12:12 hipsterusername

and all the weighting and blending InvokeAI does means it is already messing with that stuff anyway.

Not liking the sound of that. I was under the impression that given the same parameters, Automatic111 and Invoke AI (or any other SD fork for that matter) would spit the same result/identical image. Apparently the Frankenstein-ism has begun a long time ago.

Neosettler avatar Dec 16 '22 22:12 Neosettler

the limit-removal code right on the AUTOMATIC webui repository has changed from what @Neosettler linked at the top and i'm pretty sure it's original now

hyperjesus88 avatar Dec 20 '22 22:12 hyperjesus88

the limit-removal code right on the AUTOMATIC webui repository has changed from what @Neosettler linked at the top and i'm pretty sure it's original now

Link to code or commit?

n00mkrad avatar Dec 21 '22 03:12 n00mkrad

https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/706d5944a075a6523ea7f00165d630efc085ca22 @n00mkrad - I think

donaldanixon avatar Feb 02 '23 09:02 donaldanixon

This limit is what keeps me from using InvokeAI more often. On my local system, InvokeAI runs much better with low VRAM, making it a great choice. The web UI of this project is also so beautifully modern. Where not being able to be more creative and transfering prompts from Automatic1111 web UI is a real bummer.

From own experience I can say, the increased tokens do a lot. It is notable that the tokens before the official limit do have a much heavier influence in Automatic1111 web UI. But later tokens are still considered, making it easier to be creative with more words. After like ~4 * 77 token prompt lengths though, there is no further effect in Automatic1111's implementation. So the limit cannot be endlessly stretched.

Testertime avatar Feb 04 '23 04:02 Testertime

A cursory look suggests that the later actual implementation of unlimited tokens via concatenation was here: https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/2138 Not sure if it has changed since then.

Their wiki describes the unlimited token feature here: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#infinite-prompt-length

Typing past standard 75 tokens that Stable Diffusion usually accepts increases prompt size limit from 75 to 150. Typing past that increases prompt size further. This is done by breaking the prompt into chunks of 75 tokens, processing each independently using CLIP's Transformers neural network, and then concatenating the result before feeding into the next component of stable diffusion, the Unet.

For example, a prompt with 120 tokens would be separated into two chunks: first with 75 tokens, second with 45. Both would be padded to 75 tokens and extended with start/end tokens to 77. After passing those two chunks though CLIP, we'll have two tensors with shape of (1, 77, 768). Concatenating those results in (1, 154, 768) tensor that is then passed to Unet without issue.

It does sound similar to what NovelAI described in the blog post linked by feffy380 (https://blog.novelai.net/novelai-improvements-on-stable-diffusion-e10d38db82ac)

mirrexagon avatar Feb 04 '23 09:02 mirrexagon

InvokeAI refuses to implement it because it's based on leaked NAI info.

n00mkrad avatar Feb 04 '23 17:02 n00mkrad

After playing around with both InvokeAI and Automatic1111's web UI, I can say that the missing ability to handle longer prompts is the one major thing that prevents InvokeAI from being usable for me. Without that feature, I cannot properly get any work done in it. Too many of my prompts simply require more than 75 tokens, especially when using models that require comma separated tokens (as each comma itself is a token).

I tried manually specifying prompt blending myself, but the results I'm getting are simply not good. For whatever reason it seems like large portions of my prompt are simply being ignored. And while there may be something I'm doing wrong that I could correct, the cognitive load of having to manually break apart and maintain my prompt in different sections is simply not something I want to nor should I have to deal with. Automatic1111's web UI simply handles it for me and does a great job. It's so effective that I think such a feature should be considered standard and a must-have for a proper stable diffusion interface. And from what I can tell, there are implementations of it now that don't directly stem from the NovelAI leak so I see no reason not to include it.

It's a shame, too, because I vastly prefer the UI and workflow of InvokeAI. Its inpainting feature is killer. If it simply had the ability use longer than 75 token prompts then I would ditch the Automatic1111 UI in a heartbeat.

briankendall avatar Feb 09 '23 17:02 briankendall

@damian0815 - Acknowledging that it’s effectively just a blend, what are your thoughts on alerting the user of truncated tokens and allowing them to run it with an “auto-blended” form of the prompt? I don’t want to become so opaque as to obscure what’s happening because I think that the blend approach is way less than ideal, but the “average” user may benefit

cc: @psychedelicious @blessedcoolant

hipsterusername avatar Feb 09 '23 17:02 hipsterusername

Agreeing with @briankendall allowing more then 77 tokens seems to be a critical move to make for users.

InvokeAI refuses to implement it because it's based on leaked NAI info.

Sounds noble but does it have real weight in the big scheme of things? Find a way to make it legit and/or seek advise from a lawyer stand point if it helps you sleep better.

Neosettler avatar Feb 09 '23 17:02 Neosettler

Agreeing with @briankendall allowing more then 77 tokens seems to be a critical move to make for users.

InvokeAI refuses to implement it because it's based on leaked NAI info.

Sounds noble but does it have real weight in the big scheme of things? Find a way to make it legit and/or seek advise from a lawyer stand point if it helps you sleep better.

I was just stating the facts, I don't give a shit if it was stolen info or not, I want this feature as much as everyone else in here.

n00mkrad avatar Feb 09 '23 20:02 n00mkrad

It's not like invokeai doesn't want to support this, it's that the code was apparently lifted from a code base which does not have a license that allows us to do it. At least, the legal side of this is very vague and risky.

Also, we have been busy with a lot of other things and haven't given this the revisit it deserves.

After reading @briankendall 's feedback, it's clear that the token limit is a major hindrance to productivity.

Can anybody point us to an implementation which is from a properly licensed open source project?

Flagging @lstein for input on this.

psychedelicious avatar Feb 09 '23 20:02 psychedelicious

Sounds noble but does it have real weight in the big scheme of things? Find a way to make it legit and/or seek advise from a lawyer stand point if it helps you sleep better.

To be clear, it sounds like you're asking the question "Do ethical decisions have a real weight in the grand scheme of things?" In case this is news - Yes, they do. I'm as pragmatic a person as you can find, but even I recognize the difference between dealing with practical realities and excusing immoral behavior. There's no reason to lift stolen code when we can come up with something that works as good, or better.

Can anybody point us to an implementation which is from a properly licensed open source project?

@psychedelicious - I think we can work to implement our own, using blends and pads. It may be an experiment/science project, but at the very least we can do "something" with a prompt that would otherwise be truncated.

hipsterusername avatar Feb 09 '23 21:02 hipsterusername