Cade Daniel

Results 20 issues of Cade Daniel

Recently, we refactored the block manager subsystem to improve testability by separating concerns of each layer. See https://github.com/vllm-project/vllm/pull/3492 for more information. The V2 implementation does not have support for CPU-GPU...

good first issue
misc

Recently, we refactored the block manager subsystem to improve testability by separating concerns of each layer. See https://github.com/vllm-project/vllm/pull/3492 for more information. The V2 implementation does not yet have sliding window...

good first issue
misc

This PR allows vLLM to return correct log-probabilities of sampled tokens when speculative decoding is enabled. In addition, if the user specifies `logprobs` in their request, the correct top-k logprobs...

### Anything you want to discuss about vllm. We recently refactored the block allocation and management subsystem to improve its testability (PR https://github.com/vllm-project/vllm/pull/3492). We can replace the old implementation once...

misc

### Proposal to improve performance We've recently rewritten the block management subsystem for better testability. We need to profile it under real load to make sure it is performant enough...

performance

### Proposal to improve performance With the end-to-end correctness tests merged in https://github.com/vllm-project/vllm/pull/3951, now we will optimize the implementation to get ~50% speedup on 70B model. ### Work required: P0/P1...

help wanted
performance
speculative-decoding

### Proposal to improve performance In https://github.com/vllm-project/vllm/pull/3951 we disable bonus tokens (token sampled from verifier model assuming all proposal tokens are accepted) because its KV is not generated for the...

performance

## Overview Speculative decoding allows a speedup for memory-bound LLMs by using a fast proposal method to propose tokens that are verified in a single forward pass by the larger...

help wanted
performance
speculative-decoding

This PR allows tensor-parallel-size greater than 1 in vLLM's speculative decoding. It achieves this by broadcasting control flow information at the beginning of every invocation.

With Mixtral now rivaling some of the proprietary models, would be nice to have a comparison of OSS Mixtral against ChatGPT, Claude, and Gemini