text-generation-webui
text-generation-webui copied to clipboard
Strange generation speed
When I use 30B model in 4bit and --chat / --cai-chat my gen speeds are 4-6 times slower. You can see it on caps -
Is it bug or feature?
My system specs - 12700k, 32gb RAM (tried with 64gigs, but still the same issue), 3090.
OS - Manjaro Linux, cuda 11.7.
To sum up - with --chat and --cai-chat it gens 3it/s even with empty context. Without - 20it/s, stable.
4-bit 30b is slower than 8-bit 13b for me as well. I'm not sure if I ever tested it outside of chat mode. In chat mode, there's a long delay at the start, and then it streams a response at a good speed. The larger the existing prompt, the longer the initial delay. Some others complained about it as well.
In chat mode, there's a long delay at the start, and then it streams a response at a good speed. The larger the existing prompt, the longer the initial delay. Some others complained about it as well.
Yes. I think I also have this problem - it just doesn't start to gen for 5+ seconds, and that's why gen speeds are so bad. No such problem in normal (not --chat/--cai-chat) mode. So, I guess not a hardware problem.
I tried it briefly in non-chat mode, and for me, there's a similar delay there as well. However, since in non-chat mode, it generates tokens until it hits the limit (200 by default), it offsets the initial delay, whereas many of the bot's responses are 3-10 words, so delay/a few tokens makes it seem a lot slower.
I can also see it being considerably non-chat mode than in chat mode and likely it's due to this initial delay strongly penalizing short replies, also mentioned by other users. On an RTX3090 and the LLaMA-33B model, For the same initial 800 token context, short replies in chat mode get generated at around 1 token/s, while long replies in non-chat mode easily get in the 8 tokens/s range if I don't run out of memory.
I can also see this from GPU power statistics. There's an initial period (20-25 seconds) where power consumption is high, then another where it's lower and seemingly token generation actually occurs.
EDIT: it looks like this was also mentioned here: https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/34
This issue is only present with GPTQ, I compared B&B 8bit with GPTQ 8bit and GPTQ was the only one with a delay.
This seems to be a known issue with GPTQ: https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/30
On Mar 19, 2023, at 8:02 PM, USBhost @.***> wrote:
This issue is only present with GPTQ, I compared B&B 8bit with GPTQ 8bit and GPTQ was the only one with a delay.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.
This issue is only present with GPTQ, I compared B&B 8bit with GPTQ 8bit and GPTQ was the only one with a delay.
if B&B 4bit implementation (coming eventually) doesn't have this delay it's going to be a huge speedup.
This issue is only present with GPTQ, I compared B&B 8bit with GPTQ 8bit and GPTQ was the only one with a delay.
if B&B 4bit implementation (coming eventually) doesn't have this delay it's going to be a huge speedup.
Well if it's anything like 8bit it'll be close to the same speed.
This issue is only present with GPTQ, I compared B&B 8bit with GPTQ 8bit and GPTQ was the only one with a delay.
if B&B 4bit implementation (coming eventually) doesn't have this delay it's going to be a huge speedup.
Well if it's anything like 8bit it'll be close to the same speed.
Alright true enough, let me rephrase. It's going to be a huge speedup depending on the use case.
This should be fixed now in GPTQ-for-Llama's cuda branch, see https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/30.
Thanks, @aljungberg I tried it, and it is a lot quicker! There's still more of an initial delay than when using 8-bit 13B, but it's actually faster overall. Though, I did setup webui from scratch after not touching it for weeks, so maybe some other changes played into it, but I doubt it.
Edit: I might have spoken a bit too soon - with larger contexts, it slows down a good bit - I'll have to use it some more before coming to a conclusion here.
It's significantly faster than before for me, like a drop from 22 seconds to ~5 with a long context, generating 50 or so tokens. Although I didn't test that specifically with textgen.
Try enabling the faster kernels too if your HW supports half2
. half2
packs two fp16 together. Some nvidia HW can operate on both "halves" in the same amount of time it takes to operate on a single fp16, so this can be a pretty massive speed boost on the arithmetic side. (Unfortunately we might be mostly memory bandwidth bound in this case though, ATM.)
To try it, change faster_kernel
to True
here: https://github.com/oobabooga/text-generation-webui/blob/b2f356a9ae26efec5121d56d39b17fc4245ba48a/modules/GPTQ_loader.py#L18
Tried it some more, it is faster than before, though still slower than 13b-8bit, with larger contexts. I tried the faster_kernel
thing too, didn't discern a meaningful difference there.
I haven't tried with very high context sizes, as I downloaded the group size one, which OOMs at around ~1700 tokens for me (and others): https://github.com/qwopqwop200/GPTQ-for-LLaMa/pull/87
Not sure if this change was supposed to fix that or not.
Overall, my testing can hardly even be called that, so others should test it out for themselves. All I can say is 13b-8bit is overall faster for me currently, with larger contexts I may have some issues with the setup as well, as I'm using Windows, with various workarounds.
I haven't tried with very high context sizes, as I downloaded the group size one, which OOMs at around ~1700 tokens for me (and others): qwopqwop200/GPTQ-for-LLaMa#87 Not sure if this change was supposed to fix that or not.
No, the change actually uses a small amount of additional memory. Although the faster kernel might claw some of that back since the layer inputs don't have to be temporarily converted from fp16 to fp32. And yes, OOMs around that limit for me too. I did some light profiling of this and the memory spend seems to be on what you'd expect: hidden state and key-value cache. Definitely would be interesting to find ways to reduce this.
Overall, my testing can hardly even be called that, so others should test it out for themselves. All I can say is 13b-8bit is overall faster for me currently, with larger contexts I may have some issues with the setup as well, as I'm using Windows, with various workarounds.
I'm guessing you OOM with even less than 1700 tokens with 8 bit weights, right? Otherwise something surprising is going on.
Nah, it only affects this newer 4-bit model, with groupsize=128. It's a known issue: https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1484509503
I think I meant to link this issue instead of the pull request in my previous post, my bad.
Nah, it only affects this newer 4-bit model, with groupsize=128. It's a known issue: #530 (comment)
I think I meant to link this issue instead of the pull request in my previous post, my bad.
Hmm okay, I'm with you. That comment says MasterTaffer's optimisation is up to 5x as fast but uses 5% more VRAM. That's what I meant when I said that optimisation uses a little more (because it unpacks the weights all in one go rather than doing a streaming unpack), which is going to dig into your context size budget. And then the comment mentions that 4bit ungrouped uses less VRAM, but in both cases we're still talking about 4bit quant on 30b.
I misread your original comment and thought you were saying that in your testing you could have a larger context window with 8bit quant than with 4bit for 30b. But you actually said you prefer 8bit on 13b because it's fast and allows a full context, which is fair enough.
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.