exllama icon indicating copy to clipboard operation
exllama copied to clipboard

Speculative decoding?

Open bryanhpchiang opened this issue 2 years ago • 17 comments

https://github.com/dust-tt/llama-ssp

Any plans to implement speculative decoding? Would probably improve latency by at least 2x and seems not too difficult to implement.

bryanhpchiang avatar Aug 02 '23 19:08 bryanhpchiang

If I may answer for turboderp, speculative decoding is planned in some time for exllama v2 I am also interested and would really like to implement it if turboderp has lots of other stuff to do :)

reference: https://github.com/turboderp/exllama/issues/149#issuecomment-1652408059

SinanAkkoyun avatar Aug 02 '23 23:08 SinanAkkoyun

Thanks for linking! I'm excited.

The main concern I have is for speculative decoding is that latency improvements bounded by the size of the model. Since exllama only seems to support Llama style architectures, I wonder if there are any ~1B Llama models out there that could be used.

bryanhpchiang avatar Aug 03 '23 22:08 bryanhpchiang

@bryanhpchiang That's what the 3B is for :) In the end, if a 1B model with way worse performance (meaning quality performance, not speed) will result in the big model needing to reject the speculation all the time, speed will be much worse on average

SinanAkkoyun avatar Aug 03 '23 22:08 SinanAkkoyun

Makes sense! I think that'd be worth benchmarking: specifically, if you really care about latency, I think it's possible to finetune a 1B on a specific usecase to improve the error rate.

bryanhpchiang avatar Aug 03 '23 22:08 bryanhpchiang

I totally agree, I am also looking for even smaller models for some custom stuff I am working on. Do you find the roughly 220 tokens/second of the 3B model limiting? In the end, the big model seems to make the most difference. For exllama v2, there also might be a significant speed increase, also for the 3B model

(my only concern with 1B is, that there is no pretrained Llama model iirc)

SinanAkkoyun avatar Aug 03 '23 22:08 SinanAkkoyun

Great to hear that the v2 is an improvement. For my usecase, the main metric I care about is time to first token. What does that look like for 3B?

For the last point, I think that’s why non-Llama models like OPT via other libraries like CT2 might make sense.

On Thu, Aug 3 2023 at 3:20 PM, Sinan < @.*** > wrote:

I totally agree, I am also looking for even smaller models for some custom stuff I am working on. Do you find the roughly 220 tokens/second of the 3B model limiting? In the end, the big model seems to make the most difference. For exllama v2, there also might be a significant speed increase, also for the 3B model

(my only concern with 1B is, that there is no pretrained Llama model iirc)

— Reply to this email directly, view it on GitHub ( https://github.com/turboderp/exllama/issues/218#issuecomment-1664712978 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AXMEJMB6EL7CAFICF2WATSTXTQPZFANCNFSM6AAAAAA3BX3GMI ). You are receiving this because you were mentioned. Message ID: <turboderp/exllama/issues/218/1664712978 @ github. com>

bryanhpchiang avatar Aug 04 '23 04:08 bryanhpchiang

For my usecase, the main metric I care about is time to first token. What does that look like for 3B?

Well, on the 4090 I'm getting about 16,500 tokens/second for 3B. So that's about 120 ms for a 2000-token prompt.

Of course, in speculative sampling you'd also have to do inference on the prompt with the full-scale model.

turboderp avatar Aug 04 '23 05:08 turboderp

Just to confirm: 16.5K tok/s for processing the prompt, not sampling?

My use case ideally requires < 50ms until a usable chunk is produced which is why smaller models are appealing. Will run some benchmarks with some other frameworks and let you know how that goes.

On Thu, Aug 03, 2023 at 10:14 PM, turboderp < @.*** > wrote:

For my usecase, the main metric I care about is time to first token. What does that look like for 3B?

Well, on the 4090 I'm getting about 16,500 tokens/second for 3B. So that's about 120 ms for a 2000-token prompt.

Of course, in speculative sampling you'd also have to do inference on the prompt with the full-scale model.

— Reply to this email directly, view it on GitHub ( https://github.com/turboderp/exllama/issues/218#issuecomment-1665007003 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AXMEJMH4YFAO6OAUXNU5RTDXTSANDANCNFSM6AAAAAA3BX3GMI ). You are receiving this because you were mentioned. Message ID: <turboderp/exllama/issues/218/1665007003 @ github. com>

bryanhpchiang avatar Aug 05 '23 22:08 bryanhpchiang

https://github.com/turboderp/exllama/blob/master/README.md

There is a benchmark for all models, you can see new token generation and prompt speeds

SinanAkkoyun avatar Aug 05 '23 23:08 SinanAkkoyun

Great to hear that the v2 is an improvement. For my usecase, the main metric I care about is time to first token. What does that look like for 3B? For the last point, I think that’s why non-Llama models like OPT via other libraries like CT2 might make sense. On Thu, Aug 3 2023 at 3:20 PM, Sinan < @.*** > wrote: I totally agree, I am also looking for even smaller models for some custom stuff I am working on. Do you find the roughly 220 tokens/second of the 3B model limiting? In the end, the big model seems to make the most difference. For exllama v2, there also might be a significant speed increase, also for the 3B model (my only concern with 1B is, that there is no pretrained Llama model iirc) — Reply to this email directly, view it on GitHub ( #218 (comment) ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AXMEJMB6EL7CAFICF2WATSTXTQPZFANCNFSM6AAAAAA3BX3GMI ). You are receiving this because you were mentioned. Message ID: <turboderp/exllama/issues/218/1664712978 @ github. com>

I haven't found a good 3B model for ExLLama yet. There's open_llama_3b_v2-8k-GPTQ, but it's not actually good, at least not compared to the orca-mini. 3B GGML are rare, 3B GPTQ for ExLLama seem to be even ~~rarer~~ more rare. I've successfully used "orca-mini-3b.ggmlv3.q4_1.bin" with llamacpp, in case it helps. 70+ tokens per second inference on my notebook's 3060 with 6 gigs (fully offloaded to the GPU), CPU set to one thread.

I can look up the prompt's t/sec if you want to, but reaction time is fast.

SolsticeProjekt avatar Aug 06 '23 08:08 SolsticeProjekt

Here's one. It's the one the results in the readme are based on. Seems to work alright.

turboderp avatar Aug 06 '23 08:08 turboderp

Here's one. It's the one the results in the readme are based on. Seems to work alright.

Thanks. This is the result of test_benchmark_inference using "-p -ppl":

notebook, 5900HS, 3060 6gigs:

First Pass: ** Time, Inference: 0.68 seconds ** Speed: 2806.31 tokens/second -- Generating 128 tokens, 1920 token prompt... ** Speed: 49.32 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 72.90 tokens/second ** VRAM, Inference: [cuda:0] 522.08 MB ** VRAM, Total: [cuda:0] 3,128.43 MB -- Loading dataset... -- Testing 100 chunks.......... ** Perplexity: 7.8114

I don't know what all the passes are there for, but 72.9 t/sec is around what I get with llamacpp using the orcamini3B. This one performs a lot better in terms of perplexity at 7.81, compared to open_llama_3b_v2-8k-GPTQ at 8.2. Sadly there's no orca-mini 3B, except for one called "badtest" on hf, which I won't try for obvious reasons.

Thanks!

(Edit: These openllama models pale in comparison to orcamini, or my prompts are all wrong.)

SolsticeProjekt avatar Aug 06 '23 09:08 SolsticeProjekt

@SolsticeProjekt

https://huggingface.co/SinanAkkoyun/orca_mini_3b_gptq_badtest :)

This is for actual chatting and not a base model. I quantized it myself, that's why it's called badtest, although it performs wonderfully and in some neiche tasks including listening to system prompts even impressed me more than the 7B chat in some cases

SinanAkkoyun avatar Aug 06 '23 12:08 SinanAkkoyun

@SolsticeProjekt

https://huggingface.co/SinanAkkoyun/orca_mini_3b_gptq_badtest :)

This is for actual chatting and not a base model. I quantized it myself, that's why it's called badtest, although it performs wonderfully and in some neiche tasks including listening to system prompts even impressed me more than the 7B chat in some cases

Thanks, I'll give it a go. I'm trying to figure out how to quantize models myself, but this is going really off-topic now ... so thank you, I'll see what it can do. :D

SolsticeProjekt avatar Aug 06 '23 12:08 SolsticeProjekt

I'm trying to figure out how to quantize models myself

Basically, install AutoGPT and look at my model README, you can quantize them with the other dataset too if you wanted to, it might be easier

SinanAkkoyun avatar Aug 06 '23 13:08 SinanAkkoyun

I'm trying to figure out how to quantize models myself

Basically, install AutoGPT and look at my model README, you can quantize them with the other dataset too if you wanted to, it might be easier

Tried that already. Ended up not working without output or error message. Looked like it failed loading checkpoints that apparently weren't there, but it should have been working anyway because someone else used the exact same data.

Yours worked fine, except that it makes the same mistakes as all the others I've tested with exllama. I've learned to use the Orca-mini 3B I have as ggml as "the bar", because the results were really good and precise. Beginning to think the issue comes with exllama and has nothing to do with the models, but I can't cross-compare models in llamacpp so... there's that, I guess.

Anyhow. vOv

SolsticeProjekt avatar Aug 06 '23 14:08 SolsticeProjekt

@SolsticeProjekt Very interesting, please tell me more about what exact issue that is in detai?

SinanAkkoyun avatar Aug 07 '23 00:08 SinanAkkoyun