vllm [RFC] Drop beam search support

Motivation.

TL;DR: To reduce system complexity and enable future optimizations, we propose discontinuing beam search support.

Currently, vLLM supports 3 types of sampling: greedy, random, and beam search. Beam search, which dynamically creates and removes top-k branches at each step, is the most complex of the three. Traditionally, beam search has been popular for NLP tasks like translation and summarization. However, in the LLM era, beam search has become less common. Major LLM APIs such as GPT, Gemini, and Claude do not support it.

In vLLM, beam search initially motivated the idea of PagedAttention. Actually, vLLM excels at beam search compared to other inference engines, since PagedAttention can efficiently handle the dynamic nature of beam search and minimize its KV cache usage. Despite this, implementing beam search introduces significant system complexity, hindering potential optimizations. It complicates the system while being used rarely.

To resolve this, we propose eliminating beam search support, which will provide the following benefits:

Reduced Complexity in Sampling and Output Processing
- The current code for sampling and output processing is complex partly because beam search is considered. Without beam search, vLLM will only need to support greedy or random sampling, and this will greatly simplify the code.
More Predictable Block Table
- Beam search causes the block table for PagedAttention to change dynamically at each step. This leads to synchronization between the model runner and the scheduler. Removing beam search will be the first step to allow them to operate asynchronously.
Potential Future Removal of SequenceGroup
- SequenceGroup is used when a request maps to multiple output sequences via parallel sampling or beam search. It helps manage memory sharing and enforce gang-scheduling of the sequences. Without beam search, we can potentially eliminate SequenceGroup, as parallel sampling does not require gang-scheduling, and memory sharing can be managed by prefix caching.

Proposed Change.

We plan to execute this in 3 steps:

Remove beam search and its directly related code in the sampler and output processor.
Simplify the code further, leveraging the fact that vLLM will only support greedy or random sampling.
Enable the future optimizations described above.

We are open to reintroducing beam search if there is strong demand from the community. Please share any concerns regarding this decision. We apologize for any inconvenience caused by this change.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Jul 08 '24 22:07 WoosukKwon

We should disable beam search superficially in the next vLLM release (e.g. assert False, "beam search is deprecated, see https://github.com/vllm-project/vllm/issues/6226") and see the reaction. If there is a lot of noise then we should consider taking a path that maintains compatibility.

Jul 09 '24 06:07 cadedaniel

Beam Search gives consistent results and is used in Production level systems where predictable results are important. So dropping beam search would be a bad idea IMHO. Setting temperature=0 provides some predictable results but not always.

Jul 09 '24 11:07 hrsmanian

MLPerf inference benchmark requires the beam search feature on so I think this is still useful in the industry. Here's the link to the MLPerf inference rules: https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc#413-additional-inference-parameters

thanks, -yuan

Jul 10 '24 07:07 zhouyuan

Regarding MLPerf Inference @zhouyuan , it is only needed for the GPT-J benchmark (which was the first LLM task they added) and is not used for Llama 2 70B or Mixtral 8x7B (which are more recent). I don't believe beam search will be used in future tasks since it is generally not practical for cost-effective deployment.

Jul 11 '24 18:07 mgoin

To serve as an alternative to still enable customers would like similar features. I would like to propose a new param to introduce in our current vLLM system, let's call that num_q (number of queries or former num_beams). With this being set, let's say we set num_q=5, what it does will be similar to best_of or n. but instead of doing that, it will bring the top 5 token for the first token generation and generate them to max_tokens length of the output.

Request:

{"prompt": "recommend me the best city to visit in this world: ", "num_q": 5, "max_tokens": 100}

Response:

{"result": [
{  "output":  "Paris", log_probs: -0.123123},
{  "output":  "Amsterdam", log_probs: -0.123123},
{  "output":  "Beijing", log_probs: -0.123123},
{  "output":  "Dubai", log_probs: -0.123123},
{  "output":  "Bogota", log_probs: -0.123123},
]}

Customer can gauranteed to get 5 different responses and its logprobs. Where in the meantime, they can still conduct beam search themselves through choosing the one with best log_probs. But with doing this, it is much lesser complication introduced to vLLM to ahieve something. It also brings the freedom for more customization for users to decide what sequence they want to use.

num_q => num_beams
max_tokens => beam_width

Jul 12 '24 00:07 lanking520

@cadedaniel Thanks for the suggestion!

Here's what we've decided to do:

We'll add a deprecation warning for beam search (#6402) and plan to release a new version next week.
After the release, we'll gather user feedback and usage data (#6404) for 2-3 weeks.
In the meantime, we'll work on a separate branch to remove beam search and implement code simplification and optimizations.
For the v0.6.0 release, unless we receive strong pushback, we'll merge the changes from the branch developed in step 3.

Jul 13 '24 04:07 WoosukKwon

Beam Search gives consistent results and is used in Production level systems where predictable results are important. So dropping beam search would be a bad idea IMHO. Setting temperature=0 provides some predictable results but not always.

+1, our teams observe benefits for reliability and occasionally even latency from beam search, highly relevant in Prod

Jul 13 '24 04:07 nightflight-dk

Major LLM APIs such as GPT, Gemini, and Claude do not support it.

Yes. The most commonly used now is top-p and top-k sampling.

Jul 13 '24 05:07 zhyncs

I kindly suggest maintaining beam search support, as it is the primary option for translation tasks, even with LLMs.

Jul 15 '24 04:07 HeegonJin

@nightflight-dk Thanks for your input! Are you using vLLM in production? If so, we'd be happy to discuss our plan with you.

Jul 15 '24 18:07 WoosukKwon

A potential use-case we have is that sometimes using guidance/outlines/lm-format-enforcer can result in "forcing" the model down a path it doesn't really want to go. So e.g. if we ask the model to extract the color from 'Navy blue T-shirt' and we restrict the output to be in Spanish (e.g. 'Azul', 'Naranja'), smaller models will output Naranja since the model is aiming for outputting Navy blue (so the first token will be Na, after which we force the model the output Naranja). With beam search we can let the model "look-ahead" across tokens. We planned on experimenting with beam search to experiment with whether that would help in cases like these.

Adding the fact that the model should choose from Azul and Naranja to the prompt doesn't work well enough for smaller models, they still want to output Navy blue.

Jul 16 '24 11:07 SemMulder

I think the typical use case for taking multiple samples is when you have a method for "trying" a sample. Perhaps the first sample "fails", and then you want to try the second sample, etc. (Our specific use case is formal proof search.)

Beam search is well suited for this application, because the beams provide diversity. With random sampling I could end up retrying the same "almost surely good" idea over and over, instead of continuing to the second idea. It's true that beams ranking lower are likely bad. But trying a bad idea still beats trying the same good idea twice.

That said, I'm a fan of simpler code. If random sampling is much faster than beam search, we can just deduplicate the samples or something. I will run some experiments to measure how this will affect us.

Jul 16 '24 13:07 darabos

We have noticed that token level logprobs from beam search are quite informational compared to those from nucleus sampling. A lot of our workflows depend on these logprobs and I'd suggest keeping beam search support as well!

Jul 16 '24 17:07 DhruvaBansal00

We heavily depend on beam search at Heavy.ai in VLLM in production with customers to give optimal accuracy for text-to-SQL tasks (https://www.heavy.ai/heavyiq/overview), and would lose significant accuracy with it turned off. Perhaps we could implement it ourselves using the log probabilities (would be nervous about the performance though) or freeze our version to 0.5.2, but neither is ideal at all.

We are also looking at various sampled approaches using a judge model to pick the best, and here again taking the top-n beam search generations provides better accuracy than setting a non-zero temperature and taking n samples.

From the above I understand the motives but I'd request that this be reconsidered. It's not just us either, pretty much all the SOTA text-to-SQL approaches use beam search to get best accuracy.

Jul 16 '24 22:07 tmostak

Beam search is a deal breaker for our use case. We use it extensively in prod. We have found that it increases the accuracy of our LLM's responses by roughly 1%, which is absolutely critical for our use case. Unfortunately if vLLM stops supporting beam search we'll have to switch to an unoptimized inference engine.

Jul 17 '24 13:07 physicsrob

We are considering using beam search as it actually improves performance and we are reviewing its use at the production level.

This alone might make us reconsider using vLLM. The speed and complexity of implementation could be seen as a trade-off for better performance and the ability to infer model choice paths. Furthermore, we might overcome limitations with streaming operations.

Must we really delete it? We do not want that. Alternatively, if must delete it, need to another solution. HOW?

Jul 18 '24 06:07 YooSungHyun

We, at Spotify, use vLLM beam search to sample multiple answers from the same prompt in multiple textual tasks. This breaking change would hurt us significantly and we may have to reconsider vllm usage for some of our use cases, if there are no alternatives :( please, reconsider it

Feel free to DM me

Jul 18 '24 11:07 denadai2

We are very much relying on beam search for our biomedical industry applications to significantly boost performance in our setting. That benefit is large enough to consider alternative projects for serving, but we would hate to have to abandon vllm :(

Jul 18 '24 13:07 sjmielke

We are using beam search in production and would appreciate its continued support

Jul 18 '24 23:07 Reichenbachian

For production usacases, please also indicate why you choose beam search, and why not the rest sampling method. Many public API service does not provide beam search, and what would you do if you don't have beam search? (i.e. any workaround?)

A possible workaround: LLMs are very smart at present, if you just want output diversity, how about adding a system prompt to instruct it for more diverse output?

Jul 18 '24 23:07 youkaichao

As a user of guidance/AICI/other methods of constraining LLM output, disabling beam search can reduce quality of outputs. For the reason users describe above.

We've noticed that across a wide array of models, these two facts interact poorly:

LLM tokens may contain more than one lexical JSON token, e.g.: the token ]} and prefer these, as they are more efficient to generate than ] and } independently - they are "over-weighted" in sampling.
Once the model emits a token closing a string, array, or object, the model cannot backtrack and correct itself if the schema requires - or it would be "sensible" - to admit another chunk of text, array element, or key.

For vLLM with open source models, beam search helps overcome this obstacle, in effect giving the model a weak form of backtracking.

With LLM APIs, we maintain a list of tokens which we add a small negative weight to, however this list is not exhaustive and of course, we need to derive the token IDs for each unique tokenizer.

In my experience, beam search works better than negative weighting these tokens, and is more straightforward and adaptable to multiple models.

This is a sample of our "verboten tokens" file:

[
  "\",",
  "],",
  "[\"",
  "[]",
  ",\"",
  "\"]",
  "][",
  "},",
  "\",\"",
  "{{",
  "\"\"",
  "}}",
  "{\"",
  "]]",

Jul 18 '24 23:07 AaronFriel

@AaronFriel do you try the guided decoding at https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters-for-chat-api ?

Jul 19 '24 00:07 youkaichao

I'm familiar with Guidance, yes, I mentioned it in my reply.

Jul 19 '24 00:07 AaronFriel

My understanding is that guided decoding in particular benefits from beam search, for reasons alluded to here and here, i.e. you can get into nasty situations with guided decoding where the probabilities of earlier tokens can be skewed by the probabilities of later tokens, even if some of those combinations are disallowed by the guided choice/regex/json.

Jul 19 '24 15:07 tmostak

We also use beam search in a production deployment of vllm and would probably have to migrate off of vllm without it. We're optimizing for accuracy on a structured task, not diversity of output, and have found that beam search produces the best results.

Jul 22 '24 16:07 hinnefe2

My HEAVY.AI colleagues have already commented, but to add a little detail ... we use beam search so that we can performantly constrain things to our specific SQL syntax. We've found it to be faster than alternatives and are using it in production across multiple accounts.

Jul 22 '24 22:07 mflaxman10

@hrsmanian @zhouyuan @lanking520 @nightflight-dk @HeegonJin @SemMulder @darabos @DhruvaBansal00 @tmostak @physicsrob @YooSungHyun @denadai2 @sjmielke @Reichenbachian @AaronFriel @hinnefe2 @mflaxman10

Due to strong pushback from the community, we have decided to reconsider this proposal. vLLM will continue to support beam search until further notice. We will try to find other ways to optimize the system overheads and get back to this proposal after more exploration. Thanks everyone for the feedback!

Jul 22 '24 22:07 WoosukKwon

@WoosukKwon thank you. Super appreciated!

To give more context, Spotify, among other use cases, needs to have long exact inference (e.g. for recommendations). Thus, beam search is great for this :)

Jul 23 '24 12:07 denadai2

An update on this thread:

For users who need beam search, I'd like to know how sensitive you are w.r.t. the latency and throughput of the inference. Per my understanding, beam search is quite slow in terms of both latency and throughput. If you use beam search, I assume you are not very sensitive to the speed, but just want the quality of generation from beam search.

Why I ask this? Because I'd like to move the beam search logic one level higher above the current vLLM. Say we have an inference engine that supports openai api server, it seems we can emulate one api server with beam search, by asking the api server to produce one token at a time, with multiple logprobs:

def beam_search_proxy(sequence, beam_width, max_tokens):
    candidates = [sequence]
    finished = []
    while candidates:
        new_candidates = []
        for seq in candidates:
            for token, logprob in generate(seq, max_tokens=1, logprobs=beam_width):
                new_candidates.append(new_seq(seq, token, logprob))
        finished += [x.is_finished() for x in new_candidates]
        new_candidates = [x for x in new_candidates if not x.is_finished()]
        new_candidates.sort(key=lambda x: x.cummulative_logprobs, reverse=True)
        candidates = new_candidates[:beam_width]
    finished.sort(key=lambda x: x.cummulative_logprobs, reverse=True)
    return finished[:beam_width]

the sharing of memory and computation among sequences, can be achieved via prefix caching.

disclaimer: I'm not familiar with beam search, and the semantic of the above function can be wrong. please just read the idea, to emulate beam search with a normal openai api server.

If we can go to this direction, the outcome would be:

vllm does not support beam search by itself
but vllm will provide a beam search emulator to turn the openai api server into a server with beam search functionality
what's more, this emulator is not specific to vllm. you can also use it to turn any openai api server into a server with beam search functionality

Sep 02 '24 18:09 youkaichao

For users who need beam search, I'd like to know how sensitive you are w.r.t. the latency and throughput of the inference.

We are very sensitive to throughput, but not latency. We need the highest possible throughput with beam search. If there's a substantial drop in overall compute efficiency, or drop of beam search support, we would migrate our inference elsewhere (or possibly fork, although TBH we don't want to be in the business of optimizing inference.)

For what it's worth, I think it's unlikely that moving to a higher level abstraction would work without a substantial drop in throughput. My weak evidence for this: https://github.com/vllm-project/vllm/issues/1646

We currently monkeypatch our VLLM in production to make the fork operation performant. I honestly hate that we do this, but the cost implications of not doing it are unacceptable.

Sep 02 '24 19:09 physicsrob