InvokeAI
InvokeAI copied to clipboard
[enhancement]: 77 maximum tokens?
Is there an existing issue for this?
- [X] I have searched the existing issues
Contact Details
No response
What should this feature add?
Hello, any specific reason why there is a hard cap of the maximum number of tokens?
merge_embeddings.py line 28 max_length=77,
One would think tokens would be discarded based on relevance but they are simply trimmed after the 77th token.
Alternatives
No response
Aditional Content
No response
CLIP does not support more than that.
the limit have been lifted here:
I read somewhere that this change have been reverted for unknown reason, might worth investigating.
This likely won't be ported due to political/PR reasons - The CLIP token limit removal is based on leaked NovelAI knowledge.
Yikes, I didn't know that. Well, cant erased the fact that it is possible. Having max tokens capped to 77 due to political/PR reasons seems to me like a crime against progress.
Could have sworn that this was a toggle on A1111 from before the NovelAI situation... hmm
Could have sworn that this was a toggle on A1111 from before the NovelAI situation... hmm
99% sure that's not the case, it was added just days after the leak, if not on the same
99% sure that's not the case, it was added just days after the leak, if not on the same
Yeah, the dates on the commit match. My mistake
so... unfortunate circumstances for this particular case aside, could we skip to the part where this feature would be a great addition or there is no other way to implement this without asking for trouble?
I had a look at the implementation here: https://github.com/huggingface/diffusers/blob/main/examples/community/lpw_stable_diffusion.py#L209
My initial take after reading this is: this is something that wouldn't be too difficult to add. With or without diffusers doesn't make much difference, everyone is using transformers.CLIPTextModel to turn text to CLIP embeddings, and all the weighting and blending InvokeAI does means it is already messing with that stuff anyway.
It is not quite as simple as pretending the token limit doesn't exist. You still can only send that many through CLIPTextModel at one time. but apparently if you feel like it, you can do that a couple times and concatenate them all together and send the whole mess to the diffusion model?
I think the biggest unknowns will be:
- Like giving the diffusion model latents that aren't 512 (encoded to 64) square, if you change the dimensions on it, it might do weird shit. In this case we're changing the dimensions of the embeddings instead of the latents, to something different than what it was trained on.
- When preparing those embeddings in chunks, it might matter how you divide up those chunks. The most obvious case is if you have a multi-token word, I have a strong suspicion that it won't be represented very well if it gets chopped in half and the first syllable is at the end of one embedding and the rest of it is in the next.
Personal recommendation: this isn't something that should be done transparently by default.
First, focus on some UI features to make it more visible how InvokeAI is interpreting your prompt. e.g. what the positive and negative prompts are (conditioning and un-conditioned), whether it is splitting and blending multiple prompts, etc. This information is currently available in a very dense syntax-heavy format if you use log-tokenization in the TUI, but is invisible to the Web UI, and you only know about it after you start things generating.
Then we can build on that and show if your long prompt is split in to multiple embeddings, where it is split, etc.
Would love to hear ideas on how we move closer to this, and what the implications are on the prompt. We’re contemplating the prompt crafting UX and think this is going to have to play mightily into it
Novel has stolen from SD first, why do we even care. The cat is out of the bag anyway.
Novel has stolen from SD first, why do we even care. The cat is out of the bag anyway.
SD is Open Source, you can't steal from it. NovelAI is not.
Is open-source allowed to be used for profit though? I recall many licenses prohibit that kind of thing.
Yes. For example, InvokeAI is MIT licensed and can (and is) used to build for profit commercial products
Given that NAI themselves describe their technique in a public blog post, it should be entirely possible to implement without referencing any "tainted" code, making the political/PR reasons argument moot.
@feffy380 they don’t seem to provide any code?
At this point, we’re sufficiently convinced it would be ok to implement so long as we have something TO reference in implementing (that isn’t autos code)
and all the weighting and blending InvokeAI does means it is already messing with that stuff anyway.
Not liking the sound of that. I was under the impression that given the same parameters, Automatic111 and Invoke AI (or any other SD fork for that matter) would spit the same result/identical image. Apparently the Frankenstein-ism has begun a long time ago.
the limit-removal code right on the AUTOMATIC webui repository has changed from what @Neosettler linked at the top and i'm pretty sure it's original now
the limit-removal code right on the AUTOMATIC webui repository has changed from what @Neosettler linked at the top and i'm pretty sure it's original now
Link to code or commit?
https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/706d5944a075a6523ea7f00165d630efc085ca22 @n00mkrad - I think
This limit is what keeps me from using InvokeAI more often. On my local system, InvokeAI runs much better with low VRAM, making it a great choice. The web UI of this project is also so beautifully modern. Where not being able to be more creative and transfering prompts from Automatic1111 web UI is a real bummer.
From own experience I can say, the increased tokens do a lot. It is notable that the tokens before the official limit do have a much heavier influence in Automatic1111 web UI. But later tokens are still considered, making it easier to be creative with more words. After like ~4 * 77 token prompt lengths though, there is no further effect in Automatic1111's implementation. So the limit cannot be endlessly stretched.
A cursory look suggests that the later actual implementation of unlimited tokens via concatenation was here: https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/2138 Not sure if it has changed since then.
Their wiki describes the unlimited token feature here: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#infinite-prompt-length
Typing past standard 75 tokens that Stable Diffusion usually accepts increases prompt size limit from 75 to 150. Typing past that increases prompt size further. This is done by breaking the prompt into chunks of 75 tokens, processing each independently using CLIP's Transformers neural network, and then concatenating the result before feeding into the next component of stable diffusion, the Unet.
For example, a prompt with 120 tokens would be separated into two chunks: first with 75 tokens, second with 45. Both would be padded to 75 tokens and extended with start/end tokens to 77. After passing those two chunks though CLIP, we'll have two tensors with shape of
(1, 77, 768)
. Concatenating those results in(1, 154, 768)
tensor that is then passed to Unet without issue.
It does sound similar to what NovelAI described in the blog post linked by feffy380 (https://blog.novelai.net/novelai-improvements-on-stable-diffusion-e10d38db82ac)
InvokeAI refuses to implement it because it's based on leaked NAI info.
After playing around with both InvokeAI and Automatic1111's web UI, I can say that the missing ability to handle longer prompts is the one major thing that prevents InvokeAI from being usable for me. Without that feature, I cannot properly get any work done in it. Too many of my prompts simply require more than 75 tokens, especially when using models that require comma separated tokens (as each comma itself is a token).
I tried manually specifying prompt blending myself, but the results I'm getting are simply not good. For whatever reason it seems like large portions of my prompt are simply being ignored. And while there may be something I'm doing wrong that I could correct, the cognitive load of having to manually break apart and maintain my prompt in different sections is simply not something I want to nor should I have to deal with. Automatic1111's web UI simply handles it for me and does a great job. It's so effective that I think such a feature should be considered standard and a must-have for a proper stable diffusion interface. And from what I can tell, there are implementations of it now that don't directly stem from the NovelAI leak so I see no reason not to include it.
It's a shame, too, because I vastly prefer the UI and workflow of InvokeAI. Its inpainting feature is killer. If it simply had the ability use longer than 75 token prompts then I would ditch the Automatic1111 UI in a heartbeat.
@damian0815 - Acknowledging that it’s effectively just a blend, what are your thoughts on alerting the user of truncated tokens and allowing them to run it with an “auto-blended” form of the prompt? I don’t want to become so opaque as to obscure what’s happening because I think that the blend approach is way less than ideal, but the “average” user may benefit
cc: @psychedelicious @blessedcoolant
Agreeing with @briankendall allowing more then 77 tokens seems to be a critical move to make for users.
InvokeAI refuses to implement it because it's based on leaked NAI info.
Sounds noble but does it have real weight in the big scheme of things? Find a way to make it legit and/or seek advise from a lawyer stand point if it helps you sleep better.
Agreeing with @briankendall allowing more then 77 tokens seems to be a critical move to make for users.
InvokeAI refuses to implement it because it's based on leaked NAI info.
Sounds noble but does it have real weight in the big scheme of things? Find a way to make it legit and/or seek advise from a lawyer stand point if it helps you sleep better.
I was just stating the facts, I don't give a shit if it was stolen info or not, I want this feature as much as everyone else in here.
It's not like invokeai doesn't want to support this, it's that the code was apparently lifted from a code base which does not have a license that allows us to do it. At least, the legal side of this is very vague and risky.
Also, we have been busy with a lot of other things and haven't given this the revisit it deserves.
After reading @briankendall 's feedback, it's clear that the token limit is a major hindrance to productivity.
Can anybody point us to an implementation which is from a properly licensed open source project?
Flagging @lstein for input on this.
Sounds noble but does it have real weight in the big scheme of things? Find a way to make it legit and/or seek advise from a lawyer stand point if it helps you sleep better.
To be clear, it sounds like you're asking the question "Do ethical decisions have a real weight in the grand scheme of things?" In case this is news - Yes, they do. I'm as pragmatic a person as you can find, but even I recognize the difference between dealing with practical realities and excusing immoral behavior. There's no reason to lift stolen code when we can come up with something that works as good, or better.
Can anybody point us to an implementation which is from a properly licensed open source project?
@psychedelicious - I think we can work to implement our own, using blends and pads. It may be an experiment/science project, but at the very least we can do "something" with a prompt that would otherwise be truncated.