Cade Daniel issues

Results 20 issues of


                                            Cade Daniel

[Misc]: Implement CPU/GPU swapping in BlockManagerV2

Recently, we refactored the block manager subsystem to improve testability by separating concerns of each layer. See https://github.com/vllm-project/vllm/pull/3492 for more information. The V2 implementation does not have support for CPU-GPU...

good first issue

misc

[Misc]: Implement SlidingWindowBlockTable in BlockManagerV2

good first issue

misc

[Speculative decoding] Support target-model logprobs

This PR allows vLLM to return correct log-probabilities of sampled tokens when speculative decoding is enabled. In addition, if the user specifies `logprobs` in their request, the correct top-k logprobs...

[Tracking issue] [Help wanted]: Deprecate BlockManagerV1

### Anything you want to discuss about vllm. We recently refactored the block allocation and management subsystem to improve its testability (PR https://github.com/vllm-project/vllm/pull/3492). We can replace the old implementation once...

misc

[Performance]: Profile & optimize the BlockManagerV2

### Proposal to improve performance We've recently rewritten the block management subsystem for better testability. We need to profile it under real load to make sure it is performant enough...

performance

[Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding

### Proposal to improve performance With the end-to-end correctness tests merged in https://github.com/vllm-project/vllm/pull/3951, now we will optimize the implementation to get ~50% speedup on 70B model. ### Work required: P0/P1...

help wanted

performance

speculative-decoding

[Speculative decoding] [Performance]: Re-enable bonus tokens

### Proposal to improve performance In https://github.com/vllm-project/vllm/pull/3951 we disable bonus tokens (token sampled from verifier model assuming all proposal tokens are accepted) because its KV is not generated for the...

performance

[Performance] [Speculative decoding]: Support draft model on different tensor-parallel size than target model

## Overview Speculative decoding allows a speedup for memory-bound LLMs by using a fast proposal method to propose tokens that are verified in a single forward pass by the larger...

help wanted

performance

speculative-decoding

[Speculative decoding] Enable TP>1 speculative decoding

This PR allows tensor-parallel-size greater than 1 in vLLM's speculative decoding. It achieves this by broadcasting control flow information at the beginning of every invocation.

Mixtral comparison to OAI / Anthropic / Google

With Mixtral now rivaling some of the proprietary models, would be nice to have a comparison of OSS Mixtral against ChatGPT, Claude, and Gemini