llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Combine large LLM with small LLM for faster inference

Open ggerganov opened this issue 2 years ago • 36 comments

So I was thinking about the following idea. It is probably completely bogus, but I would definitely investigate it when and if I had the time to, so maybe someone else would be interested as well.


Large LLM takes a lot of time to perform token inference. Lets say it takes 500ms per token.

A small LLM (or some other approach) can infer a token very fast. Lets say < 5ms.

Lets assume that the small LLM is correct 80-90% of the time.

The idea is the following:

  • Before I run the large LLM inference for the next token, I infer it using the small LLM
  • I now want to somehow partially evaluate the large LLM (let's say the first 10% of the layers) and get an approximate estimate for the next token
  • If this estimate indicates a high probability for that token (i.e. above some threshold) - we stop and directly say that this is the new token. At this point we would have consumed (5ms for the small LLM + ~50ms for the large LLM)
  • Otherwise, we proceed to evaluate the rest of the layers of the large LLM

In the described process, I would reach step 4 only for 10-20% of the tokens, but for the rest - I will take the shortcut in step 3. Hence, I will have an efficient inference with a Large LLM.

Obviously, the biggest question is if step 2 is possible at all. I suppose the answer is "no", but who knows.

ggerganov avatar Mar 30 '23 17:03 ggerganov

One idea, which would require adding extra model parameters and doing some training:

Because llama (and other transformer language models) are residual networks, i.e. each block adds to the output of the previous instead of replacing it. This can be thought of each block refining the model's current understanding of what token will occur next.

Maybe what might be possible is to add a linear layer after each block, one that produces its own estimate for the token distribution. The last block already has this; it is the output of the network.

I'm thinking if each block can estimate the token probability, maybe you could compute it at each block, and stop early if it's converged. That is, if the subsequent blocks are not appreciably changing the token probabilities we can just stop early.

I did an experiment where I limited llama.cpp to only use the first N blocks of the network. it is surprisingly coherent at 27/32 blocks.

Deepmind has a paper called "pondernet" that has a similar idea, but I believe doing this would require more changes to the model.

blackle avatar Mar 30 '23 21:03 blackle

If this estimate indicates a high probability for that token (i.e. above some threshold) - we stop and directly say that this is the new token. At this point we would have consumed (5ms for the small LLM + ~50ms for the large LLM)

If this approach worked reliably, couldn't you just run only the first few layers from a big model and just skip the rest?

I actually played with this kind of thing a while, randomly disabling chunks of layers, only evaluating every other layer, stuff like that. Unfortunately, it seemed to make a pretty big difference in quality and also doing something like only evaluating the first 10 layers of a 30 layer model produced pretty incoherent output.

It may depend on the model and which layers are skipped. Unscientifically, it seemed like the first layers were the most important.

KerfuffleV2 avatar Mar 30 '23 21:03 KerfuffleV2

@ggerganov Relevant paper https://arxiv.org/pdf/2303.08112.pdf image

Andrey36652 avatar Mar 31 '23 00:03 Andrey36652

@ggerganov I think point 2 is possible, it sounds similar to Confident Adaptive Language Modeling (CALM) from Google AI. Instead of a smaller pre-trained LLM they trained an early-exit classifier on intermediate layers which allowed them achieve roughly 50% reduction in inference time.

CALM

abetlen avatar Mar 31 '23 01:03 abetlen

I think that might like the full output using huge layers be interrupted after the first some layers,

Like if output 65b is reasonable,

Output65b * layers7b have/ layers s65b have

May not creat a reasonable result, because the 65b output need the rest layers to be fully adjusted.

FNsi avatar Mar 31 '23 05:03 FNsi

I had the following idea: You could determine in which layer the tensor results are moving away or converging by comparing each layer result from the 7B and e.g. the 65B model with the dot_product_similarity. In this way, one could at least roughly determine which layers of the small model and which layers of the large model are leading to a certain result.

And if I understand correctly, you don't need to load all layers for such experiments. You can use the mapped address of the mmap function to load and unload the layers you want to determine. This would also mean that a large model can be used with a small amount of RAM. Just load the layers in chunks, save the result in scratchpad and using the result as input to the next chunks of layers. Loading and unloading the chunks would of course affect performance.

I'm also not sure if this would work, as my understanding of the ggml library is that the calculations are done lazily instead of eagerly. Is this correct?

PriNova avatar Apr 01 '23 12:04 PriNova

Hello I had the same idea: https://github.com/facebookresearch/llama/issues/256 I'm using cerebras model. So I use the 111M model to guess the next 2-3 tokens. Then I check them in parralell batches with the 2.7B model. However many tokens are checked correct it can add to the list. This method means the more GPUs you have the faster you can predict. Not limited by the speed of your GPUs. I have tested it and it works. I think it gives a speed increase but I haven't timed it yet.

There are many ways to modify this algorithm including using the small LMM to predict also the second most likely word and so on.

elephantpanda avatar Apr 05 '23 03:04 elephantpanda

This idea of using a small model to predict multiple steps ahead and batch the scoring of generated sequence is called Speculative Sampling.

Here's a relevant paper: https://arxiv.org/abs/2302.01318

It's a completely different method than CALM that tries to train a model to output a prediction of the next token at each layer of the LLM and exit early if the confidence is high enough.

Both methods could be implemented separately and might even work together.

I think it's worth it to create another issue for optimization of inference through early exit (CALM-style).

etiennebalit avatar Apr 06 '23 10:04 etiennebalit

BTW I tried Speculative Sampling, and while the small model guessed correct about 70% of the time so it could jump ahead 2 tokens at a time I think it didn't quite compensate the extra time needed to infer a double batch. But the proof of concept worked and it would definitely give a speed boost of about 60% if you had enough GPUs.

(FUnny I thought I invented something new but that paper came out a month ago!)

elephantpanda avatar Apr 06 '23 12:04 elephantpanda

OK. I thought of a new idea. Called "chaining". Which is a modification of Speculative Sampling but chaining more models together.

You take models of size 1B, 4B, 16B Then with the 1B model you generate one token Next with the 4B model you check the previous token giving you two tokens Next with the 16B model you check the previous token giving you three tokens.

The idea is that if a model half the size runs twice as quick you can chain together these models because 1+1/2+1/4+1/8+....=2

You could play around with the numbers of models, their sizes, how many steps each one does etc. to get the best speed. Even better you could use a genetic algorithm to get the best set up.

These speculative sampling techniques will work best with low temperatures.

elephantpanda avatar Apr 06 '23 16:04 elephantpanda

I've been considering something similar but with other non LLM networks using the same technique, so thank you for posting. I'm happy to have read the other comments regarding papers which extend the concepts.

"""If this estimate indicates a high probability for that token (i.e. above some threshold) - we stop and directly say that this is the new token. At this point we would have consumed (5ms for the small LLM + ~50ms for the large LLM)"""

In this case would if you wanted to continue, then would you insert that chosen token as an input token to allow the large model to continue with proper state of logits? Would this also require more static topk/rand values, and verification of best values for temp across each model (the small, and large)? I'm guessing the newer API for setting/getting states can be used to rewind the 10% and reinsert the token you've chosen with comparison to the smaller LLM.

mikeggh avatar Apr 22 '23 16:04 mikeggh

IMO The paper https://arxiv.org/abs/2302.01318 overcomplicates things by calculating conditional probabilities etc. My method is much simpler. You use the small model to generate 2-3 tokens using the greedy algorithm, choosing the most likely token each time. Then using the large model you large model, to simultaneously test the strings with 0,1,2,3,.. tokens added on and predict the next token. If the predicted tokens match the small model's greedy algorithm you can use these tokens otherwise you can use the token generated by the first token in the batch.

I can't see any advantage in calculating all these conditional probabilities. Unless someone can prove me wrong?

elephantpanda avatar Apr 22 '23 19:04 elephantpanda

As mentioned by @etiennebalit, this idea is related to speculative sampling (and what pauldog suggested at the end here is exactly that I believe). Speculative sampling was recently landed in HF transformers (they called it assisted generation), so that might be useful as a reference: https://github.com/huggingface/transformers/pull/22211

Their implementation uses a heuristic to dynamically figure out how many tokens to generate at a time.

aljungberg avatar Apr 23 '23 22:04 aljungberg

So to recap, essentially the way this works is that we generate a sequence using the smaller model, and then evaluate it as a batch with the larger model. Then we check the logits token by token, and if we would have generated the same token with the larger model we know that we can keep it, otherwise we have to discard the rest of the sequence and continue from that point.

I see a few problems with this:

  1. It assumes that processing in batches is much faster. This is the source of all the performance gains.
  2. It assumes that we have a much smaller model that uses the same tokenizer, which is not really the case with LLaMa as far a I am aware, 7B is still too big for this
  3. The failure rate with sampling strategies other than greedy may be too high

Using BLAS may not be an option because the batches would have to be too big. Currently we only use BLAS when the batch size is at least 32, and with such a large batch a failure in one of the first tokens would be very costly.

slaren avatar Apr 23 '23 23:04 slaren

So to recap, essentially the way this works is that we generate a sequence using the smaller model, and then evaluate it as a batch with the larger model. Then we check the logits token by token, and if we would have generated the same token with the larger model we know that we can keep it, otherwise we have to discard the rest of the sequence and continue from that point.

I see a few problems with this:

  1. It assumes that processing in batches is much faster. This is the source of all the performance gains.
  2. It assumes that we have a much smaller model that uses the same tokenizer, which is not really the case with LLaMa as far a I am aware, 7B is still too big for this
  3. The failure rate with sampling strategies other than greedy may be too high

Using BLAS may not be an option because the batches would have to be too big. Currently we only use BLAS when the batch size is at least 32, and with such a large batch a failure in one of the first tokens would be very costly.

For most commercial GPUs it won't be faster. This is for situations where you have a rack of GPUs and you can max them out to get past the "theoretical limit". i.e. GPUs can only run so fast, but with this trick you can get past this speed limit. e.g. you could put one batch on each GPU. With this method, "theoretically" you could run a language model arbitrarily fast. But in practice you would need too many GPUs. As for a smaller model for Llama, IDK, what you can do here. Perhaps train a really simple model yourself. Or, use a small model with a different tokenizer and instead of comparing tokens, you can compare the text output. Although you're probably more likely to get matches on a model trained on similar data. Actually the success rate for the small models can be quite good as long as the temperature is not set too high.

elephantpanda avatar Apr 23 '23 23:04 elephantpanda

To expedite the process, we can bypass evaluating the larger model if the smaller one is sufficiently confident. This can be identified by the decoder's low entropy distribution and high-valued logits. This could handle the most grammar-controlled part of sentences.

Piezoid avatar Apr 24 '23 00:04 Piezoid

To expedite the process, we can bypass evaluating the larger model if the smaller one is sufficiently confident. This can be identified by the decoder's low entropy distribution and high-valued logits. This could handle the most grammar-controlled part of sentences.

You could do this but it will give different results in some cases. In other words, that method doesn't just speed things up, but is really a different model. Therefor the model will be slightly worse. (Maybe not great deal)

elephantpanda avatar Apr 24 '23 01:04 elephantpanda

Transformers became the leading model because it's highly parallel. During training we can get the logits for the whole sequence in parallel: you can efficiently train it on more tokens/s than any other model (today).

Speculative generation basically seeks the same advantage by making inference more like training, where you already "know" the tokens you want to get the logits for.

It assumes that processing in batches is much faster. This is the source of all the performance gains.

Right. Although I wouldn't use the word "batches". Maybe I'm just arguing semantics here but in the transformer context, batching usually refers to checking multiple different sequences in parallel. So to be really clear, that's not a requirement for speculative sampling because with a batch size of 1 you can still get the parallelism time savings of checking multiple tokens in a single sequence at once, if you know what to put in that sequence. (And if you have lots of compute to spare you could combine that with batching on top and check multiple predicted sequences in parallel.)

It assumes that we have a much smaller model that uses the same tokenizer, which is not really the case with LLaMa as far a I am aware, 7B is still too big for this.

You can make your own! You could use 2 bit quantization on the 7B model for example, or even lower. The only thing that matters is that your assistant model gets it right often enough to pay for itself.

For most commercial GPUs it won't be faster. This is for situations where you have a rack of GPUs and you can max them out to get past the "theoretical limit". i.e. GPUs can only run so fast, but with this trick you can get past this speed limit. e.g. you could put one batch on each GPU.

You're right: if you are already utilising 100% of your FLOPs, speculative sampling won't help you. There's gotta be enough underutilised compute you can fill up with parallel token evaluation to make up for the cost of running the assistant.

With this method, "theoretically" you could run a language model arbitrarily fast. But in practice you would need too many GPUs.

Okay this is more speculative but a fun question to consider. What if we did have infinite GPUs, could we get infinite speed? I think unfortunately not. Even in a best case scenario where the assistant model is 100% correct in predicting the output of the full model, you still have to wait for it to predict X tokens, one at a time. Past some point adding more GPUs won't make the assistant model go any faster because the kernel launch time and data transfer delays overtake the benefit of adding more parallelisation to the evaluation.

And the assistant model can't generate the second token until it's generated the first token. You could have another layer of speculation on top with an even smaller model, but at some point you won't be able to find a smaller model that is correct sufficiently often. (Given a truly infinite budget I guess you could just try every possible sequence in parallel but then you don't need an assistant model at all, just brute force all possible continuations on the big model, you've got inf FLOPs!)

So roughly with speculative sampling you'll approach small model single token generation time * average number of tokens correctly predicted + big model time to evaluate the completed sequence

In theory the big model evaluates all the tokens in a sequence "at once", but as you rightly point out, that's only true for a machine with lots of multiprocessing spare capacity. In the context of running on CPUs like with llama.cpp, you'd need some bizarre number of CPU cores for this to hold true even for the 7B model. Every extra token in the sequence adds uhm, back of the napkin calculation here, at least 3 * 4096 * 4096 * seqlen half precision FMA ops just for the q, k and v linear projections, and then we get an O(n**2) attention score calculation on top of that!

Bottom line is that I don't think it's helpful for CPU eval unless you do what @Piezoid said and say "okay that generation is 'good enough' for some definition of good, I'm going to skip the big model run". But even that has limited opportunity because if you're not confident about the whole generation you're eventually going to consult with the big model and due to the autoregressive nature of transformers it's going to have to evaluate the tokens you "skipped" then anyhow.

I'm not a contributor here, just interested in the project, so someone else is probably in a better position to answer whether llama.cpp is able to use much of the theoretical FLOPs of the CPU. Only if there's a bunch of spare capacity being left on the table due to sequential bottlenecks speculative sampling would be worth a look, I think.

aljungberg avatar Apr 24 '23 11:04 aljungberg

Well the point is if you have enough GPUs, you could check every combination of 1 to 20 tokens (there are about maybe 10^90 such sequences!) . Run each one on a different GPU. Then from this data you would predict 20 tokens at once. So "theoretically" with 10^90 GPUs, you could predict 20 tokens a time in the same time you could predict 1. This just shows that theoretically you can make a LLM run as fast as you like. But practically this is another matter. However using this silly method the number of GPUs you need wouldn't even fit in the observable Universe. With using a smaller model you can reduce the amount of GPUs you need by guessing only the most likely sequences instead of all 10^90 with the expense of sometimes missing the actual sequence.

I have some doubts whether a 2bit Llama model would output anything useful or just garbage. But there is probably someway to truncate the llama model somehow.

elephantpanda avatar Apr 24 '23 17:04 elephantpanda

Bottom line is that I don't think it's helpful for CPU eval unless you do what @Piezoid said and say "okay that generation is 'good enough' for some definition of good, I'm going to skip the big model run". But even that has limited opportunity because if you're not confident about the whole generation you're eventually going to consult with the big model and due to the autoregressive nature of transformers it's going to have to evaluate the tokens you "skipped" then anyhow.

So technically you'd only skip the topk/rand part. Rewinding the logits to insert the lower models token (if you even limit at lower layers) might cost too much cpu time considering the 200-500meg copy (state). Even if you attempted to optimize by using intel instructions for SHA in quarter regions of the state, or modified the ggml to keep track of changes during the initial X layers.. I'm not sure if it would really be worth it in the end.

mikeggh avatar Apr 25 '23 01:04 mikeggh

Can you guys make a cache layer for the most frequently words? I guess that would be fast in many ways.

FNsi avatar Apr 25 '23 01:04 FNsi

Can you guys make a cache layer for the most frequently words? I guess that would be fast in many ways.

You mean given input X, always output token Y? You can stick a hash map in there containing the most frequent starting contexts and the greedy output, but the value would rapidly diminish as the context size grows beyond a few tokens. Soon enough you have unfathomable numbers of possible inputs.

Maybe it's possible to find a few situations where basically nothing in the context matters except the last few tokens in terms of picking the next one, e.g maybe [any text at all]<eos>Roses are is always followed by red no matter the value for [any text at all]. But one imagines that would be rather rare and indeed the cases where the context doesn't matter are the least useful. We use LLMs precisely because they can generalise and do something clever even with contexts never seen before, and in practice nearly all the time the context you provide is the first and last time it'll ever be seen by any LLM assuming you've chatted for more than a few messages with the bot.

aljungberg avatar Apr 25 '23 09:04 aljungberg

Well the point is if you have enough GPUs, you could check every combination of 1 to 20 tokens (there are about maybe 10^90 such sequences!) .

Yep that's what I meant by brute-forcing it. And if you have such vast resources you can just run lots of independent instances of the full size model, no need for any tricks like speculative sampling. Like you could get an LLM to generate 2048 tokens in a single go. Even then, that's not infinite speed! As you point out, at the limit you won't get it to be faster than generating 1 token. The layers still have to be processed sequentially even if the sequence itself is processed in parallel on your infinite GPU array.

I have some doubts whether a 2bit Llama model would output anything useful or just garbage. But there is probably someway to truncate the llama model somehow.

It's been tried! Results not great so far, yet not sure I'd bet against it given the rapid improvement on 3 and 4 bit quantisation. Of course even if it's roughly in the right ballpark eventually, such a model might still be useless as an assistant model. Need pretty good prediction accuracy to be worth it.

aljungberg avatar Apr 25 '23 09:04 aljungberg

Well... Later today.. I plan on saving the state of general starting context/prompts and loading that to speed up using more than what I have on my current test platform, It is just a hack so I don't have to wait to load the entire 'prompt' or 'roles' if I attempt them. I think the save/load state will work great for that but beyond static starting points then we're back to other strategies like this would have been..

mikeggh avatar Apr 25 '23 15:04 mikeggh

@mikeggh I put up an implementation of saving/restoring initial prompt state in #1169

ejones avatar Apr 25 '23 16:04 ejones

@ejones Great job :> thank you sir!

mikeggh avatar Apr 25 '23 20:04 mikeggh

I think the method is very well described here: https://huggingface.co/blog/assisted-generation, and there are some benchmarks with real-life gaining on different GPUs and tasks. TL;DR: it's helpful most times even when the small model runs in greedy mode and the large model doesn't.

pcuenca avatar May 13 '23 16:05 pcuenca

I've found a new way to do this which doesn't involve batching so is very fast.

  1. Let your current tokens be: ABCD
  2. First use the tiny model to predict the next 3 tokens using the greedy algorithm: (First input ABCD to get new token E, then input ABCDE to get new token F, then input ABCDEF to get new token G). Assuming the tiny model is tiny enough this will be very fast.
  3. Next input the sequence ABCDEFG into the larger model.
  4. This gives output of 7 probability vectors.
  5. Look at the 4th probability vector and use it to produce a token. If that token is E then goto step 6 otherwise use add token and stop.
  6. Look at the 5th probability vector and use it to produce a token. If that token is F then goto step 7 otherwise add this token and stop.
  7. Look at the 6th probability vector and use it to produce a token. If that token is G then goto step 8 otherwise add this token and stop.
  8. Look at the 7th probability vector and use it produce the final token.

This way you can predict up to 4 tokens at a time.

This assumes that the probability vector at position N is not much affected by the tokens after N+1. I don't know how true this is but it seems to work quite well for low temperature text. I have tested it with Cerbras models and get about 25% speed up and up to 2-3x speed up for very predictable text. One could also keep track of the probabilities of the tokens given by the tiny model and stop prematurely if the probabilities get too low as to become useless.

elephantpanda avatar May 22 '23 03:05 elephantpanda

I've found a new way to do this which doesn't involve batching so is very fast.

  1. Let your current tokens be: ABCD
  2. First use the tiny model to predict the next 3 tokens using the greedy algorithm: (First input ABCD to get new token E, then input ABCDE to get new token F, then input ABCDEF to get new token G). Assuming the tiny model is tiny enough this will be very fast.
  3. Next input the sequence ABCDEFG into the larger model.
  4. This gives outputs probability 7 vectors.
  5. Look at the 5th probability vector and use it to produce a token. If that token is E then goto step 6 otherwise use add token and stop.
  6. Look at the 6th probability vector and use it to produce a token. If that token is F then goto step 7 otherwise add this token and stop.
  7. Look at the 7th probability vector and use it to produce a token. If that token is G then goto step 8 otherwise add this token and stop.

This way you can predict up to 3 tokens at a time.

This assumes that the probability vector at position N is not much affected by the tokens after N+1. I don't know how true this is but it seems to work quite well.

We should add a PoC / example of this. Can use 117M GPT-2 as a helper LLM:

https://github.com/ggerganov/ggml/tree/master/examples/gpt-2

But at ~5ms / token it is still quite inefficient. We can use something even faster than that - not necessarily an LLM. Any suggestions?

Probably using @xaedes's baby LLaMAs would be highly effective: https://github.com/ggerganov/ggml/issues/8#issuecomment-1555206131 Plus the possibility to fine-tune, retrain these small models on custom, user-specific domain

Another alternative is using the TinyStories models

ggerganov avatar May 22 '23 12:05 ggerganov

As a proof of concept you could just generate some text using the greedy algorithm.

Then store this text.

Then run it again using your stored text as your "predictions" of the tiny model.

You should get back the same text but just faster. This will prove the algorithm works for a "perfect" fast predictor. Then provided this works, you can worry about actually finding a suitable tiny model later.

For a good tiny model, it would be best to have something with the same tokenization (otherwise you'd have to retokenize the text which slows things down). I don't know which ones would be compatible with Llama. Perhaps just using Llama but with a very small window might be suitable for some cases. Or one could build a very simple model and train it using Llama as input. Doesn't even have to be a transformer model.

This concept would work well with a prompt like "The alphabet: A, B, C, D, " (which could run 10x faster as its very predictable) but not much improvement with a prompt like "Here is a list of 100 random words:"

elephantpanda avatar May 22 '23 13:05 elephantpanda