Cade Daniel

Results 121 comments of Cade Daniel

Do you see any stats from the engine? You should see something like: ``` Speculative metrics: Draft acceptance rate: 0.607, System efficiency: 0.510, Number of speculative tokens: 4, Number of...

The acceptance rate stats will print every 5s, try this: ```python3 #!/usr/bin/env python3 from vllm import LLM, SamplingParams prompts = [ "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8,...

* Metrics are generated here https://github.com/vllm-project/vllm/blob/2ecf7b175703de020943b33532baaf6a31f69d3a/vllm/model_executor/layers/spec_decode_base_sampler.py#L125-L127 * Copied to CPU periodically here https://github.com/vllm-project/vllm/blob/2ecf7b175703de020943b33532baaf6a31f69d3a/vllm/spec_decode/metrics.py#L82-L96 * Copied to LLM engine here https://github.com/vllm-project/vllm/blob/2ecf7b175703de020943b33532baaf6a31f69d3a/vllm/spec_decode/spec_decode_worker.py#L741-L745 and https://github.com/vllm-project/vllm/blob/2ecf7b175703de020943b33532baaf6a31f69d3a/vllm/engine/llm_engine.py#L1483-L1489 * Printed here https://github.com/vllm-project/vllm/blob/2ecf7b175703de020943b33532baaf6a31f69d3a/vllm/engine/metrics.py#L432-L441 The metrics are currently...

Can you add a print here to verify that the acceptance rate metrics are being collected? https://github.com/vllm-project/vllm/blob/c6af027a35b657b20ec60adac77cb75264b65a98/vllm/spec_decode/metrics.py#L84-L98 They should be printed here: https://github.com/vllm-project/vllm/blob/c6af027a35b657b20ec60adac77cb75264b65a98/vllm/engine/metrics.py#L386-L392

Thanks for the fix -- approach looks good to me. On whether or not we should support this -- for performance we would want to disable this feature or support...

Hi @Vermeille. Great work. For implementation in vLLM, this can be done at a similar layer to Speculative Decoding: ``` LLMEngine CFGWorker < logic which calls the underlying worker twice,...

Thanks for the contribution! Will take a look today or tomorrow.

+1; I suggest we generalize top-1 and top-k proposing scoring (including defragmentation of accepted KV). then we can use top-1 and top-k implementations with different spec proposal methods (draft, medusa,...

If you're interested in combining chunked prefill and spec decode, see https://github.com/vllm-project/vllm/issues/5016. We have a naive dynamic speculation length policy which disables spec decode when the batch size gets too...

I will take another pass on Monday (out of office this week).