text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Strange generation speed

Open imagecreation opened this issue 1 year ago • 9 comments

When I use 30B model in 4bit and --chat / --cai-chat my gen speeds are 4-6 times slower. You can see it on caps - first second Is it bug or feature? My system specs - 12700k, 32gb RAM (tried with 64gigs, but still the same issue), 3090. OS - Manjaro Linux, cuda 11.7. To sum up - with --chat and --cai-chat it gens 3it/s even with empty context. Without - 20it/s, stable.

imagecreation avatar Mar 17 '23 23:03 imagecreation

4-bit 30b is slower than 8-bit 13b for me as well. I'm not sure if I ever tested it outside of chat mode. In chat mode, there's a long delay at the start, and then it streams a response at a good speed. The larger the existing prompt, the longer the initial delay. Some others complained about it as well.

VldmrB avatar Mar 17 '23 23:03 VldmrB

In chat mode, there's a long delay at the start, and then it streams a response at a good speed. The larger the existing prompt, the longer the initial delay. Some others complained about it as well.

Yes. I think I also have this problem - it just doesn't start to gen for 5+ seconds, and that's why gen speeds are so bad. No such problem in normal (not --chat/--cai-chat) mode. So, I guess not a hardware problem.

ghost avatar Mar 17 '23 23:03 ghost

I tried it briefly in non-chat mode, and for me, there's a similar delay there as well. However, since in non-chat mode, it generates tokens until it hits the limit (200 by default), it offsets the initial delay, whereas many of the bot's responses are 3-10 words, so delay/a few tokens makes it seem a lot slower.

VldmrB avatar Mar 18 '23 00:03 VldmrB

I can also see it being considerably non-chat mode than in chat mode and likely it's due to this initial delay strongly penalizing short replies, also mentioned by other users. On an RTX3090 and the LLaMA-33B model, For the same initial 800 token context, short replies in chat mode get generated at around 1 token/s, while long replies in non-chat mode easily get in the 8 tokens/s range if I don't run out of memory.

I can also see this from GPU power statistics. There's an initial period (20-25 seconds) where power consumption is high, then another where it's lower and seemingly token generation actually occurs.

EDIT: it looks like this was also mentioned here: https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/34

BugReporterZ avatar Mar 18 '23 11:03 BugReporterZ

This issue is only present with GPTQ, I compared B&B 8bit with GPTQ 8bit and GPTQ was the only one with a delay.

USBhost avatar Mar 20 '23 03:03 USBhost

This seems to be a known issue with GPTQ: https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/30

On Mar 19, 2023, at 8:02 PM, USBhost @.***> wrote:

 This issue is only present with GPTQ, I compared B&B 8bit with GPTQ 8bit and GPTQ was the only one with a delay.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

rohvani avatar Mar 20 '23 03:03 rohvani

This issue is only present with GPTQ, I compared B&B 8bit with GPTQ 8bit and GPTQ was the only one with a delay.

if B&B 4bit implementation (coming eventually) doesn't have this delay it's going to be a huge speedup.

musicurgy avatar Mar 25 '23 13:03 musicurgy

This issue is only present with GPTQ, I compared B&B 8bit with GPTQ 8bit and GPTQ was the only one with a delay.

if B&B 4bit implementation (coming eventually) doesn't have this delay it's going to be a huge speedup.

Well if it's anything like 8bit it'll be close to the same speed.

USBhost avatar Mar 25 '23 13:03 USBhost

This issue is only present with GPTQ, I compared B&B 8bit with GPTQ 8bit and GPTQ was the only one with a delay.

if B&B 4bit implementation (coming eventually) doesn't have this delay it's going to be a huge speedup.

Well if it's anything like 8bit it'll be close to the same speed.

Alright true enough, let me rephrase. It's going to be a huge speedup depending on the use case.

musicurgy avatar Mar 25 '23 14:03 musicurgy

This should be fixed now in GPTQ-for-Llama's cuda branch, see https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/30.

aljungberg avatar Mar 29 '23 09:03 aljungberg

Thanks, @aljungberg I tried it, and it is a lot quicker! There's still more of an initial delay than when using 8-bit 13B, but it's actually faster overall. Though, I did setup webui from scratch after not touching it for weeks, so maybe some other changes played into it, but I doubt it.

Edit: I might have spoken a bit too soon - with larger contexts, it slows down a good bit - I'll have to use it some more before coming to a conclusion here.

VldmrB avatar Mar 29 '23 19:03 VldmrB

It's significantly faster than before for me, like a drop from 22 seconds to ~5 with a long context, generating 50 or so tokens. Although I didn't test that specifically with textgen.

Try enabling the faster kernels too if your HW supports half2. half2 packs two fp16 together. Some nvidia HW can operate on both "halves" in the same amount of time it takes to operate on a single fp16, so this can be a pretty massive speed boost on the arithmetic side. (Unfortunately we might be mostly memory bandwidth bound in this case though, ATM.)

To try it, change faster_kernel to True here: https://github.com/oobabooga/text-generation-webui/blob/b2f356a9ae26efec5121d56d39b17fc4245ba48a/modules/GPTQ_loader.py#L18

aljungberg avatar Mar 29 '23 20:03 aljungberg

Tried it some more, it is faster than before, though still slower than 13b-8bit, with larger contexts. I tried the faster_kernel thing too, didn't discern a meaningful difference there. I haven't tried with very high context sizes, as I downloaded the group size one, which OOMs at around ~1700 tokens for me (and others): https://github.com/qwopqwop200/GPTQ-for-LLaMa/pull/87 Not sure if this change was supposed to fix that or not.

Overall, my testing can hardly even be called that, so others should test it out for themselves. All I can say is 13b-8bit is overall faster for me currently, with larger contexts I may have some issues with the setup as well, as I'm using Windows, with various workarounds.

VldmrB avatar Mar 29 '23 23:03 VldmrB

I haven't tried with very high context sizes, as I downloaded the group size one, which OOMs at around ~1700 tokens for me (and others): qwopqwop200/GPTQ-for-LLaMa#87 Not sure if this change was supposed to fix that or not.

No, the change actually uses a small amount of additional memory. Although the faster kernel might claw some of that back since the layer inputs don't have to be temporarily converted from fp16 to fp32. And yes, OOMs around that limit for me too. I did some light profiling of this and the memory spend seems to be on what you'd expect: hidden state and key-value cache. Definitely would be interesting to find ways to reduce this.

Overall, my testing can hardly even be called that, so others should test it out for themselves. All I can say is 13b-8bit is overall faster for me currently, with larger contexts I may have some issues with the setup as well, as I'm using Windows, with various workarounds.

I'm guessing you OOM with even less than 1700 tokens with 8 bit weights, right? Otherwise something surprising is going on.

aljungberg avatar Mar 30 '23 12:03 aljungberg

Nah, it only affects this newer 4-bit model, with groupsize=128. It's a known issue: https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1484509503

I think I meant to link this issue instead of the pull request in my previous post, my bad.

VldmrB avatar Mar 31 '23 01:03 VldmrB

Nah, it only affects this newer 4-bit model, with groupsize=128. It's a known issue: #530 (comment)

I think I meant to link this issue instead of the pull request in my previous post, my bad.

Hmm okay, I'm with you. That comment says MasterTaffer's optimisation is up to 5x as fast but uses 5% more VRAM. That's what I meant when I said that optimisation uses a little more (because it unpacks the weights all in one go rather than doing a streaming unpack), which is going to dig into your context size budget. And then the comment mentions that 4bit ungrouped uses less VRAM, but in both cases we're still talking about 4bit quant on 30b.

I misread your original comment and thought you were saying that in your testing you could have a larger context window with 8bit quant than with 4bit for 30b. But you actually said you prefer 8bit on 13b because it's fast and allows a full context, which is fair enough.

aljungberg avatar Mar 31 '23 08:03 aljungberg

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

github-actions[bot] avatar Apr 30 '23 23:04 github-actions[bot]