vllm
vllm copied to clipboard
[feat] Implement Elastic Speculation: Adaptive Draft Length + Confidence-Based Early Exit
Purpose
We implemented Elastic Speculation, an adaptive control layer for EAGLE speculative decoding that delivers double-digit latency improvements over fixed-length speculation while reducing KV-cache DRAM traffic.
Two independent features:
- Adaptive Draft Length: Dynamically adjusts speculation depth based on acceptance rates
- Confidence-Based Early Exit: Gates KV writes for low-confidence draft tokens
Summary
Adaptive Draft Length
Captures multiple draft lengths as CUDA graphs (default: [5, 10, 15]), tracks acceptance rate via EWMA, selects based on thresholds:
70% acceptance → d = high (maximize benefit) 50-70% → d = med (balanced) 30-50% → d = low (conservative) <30% → d = very low (minimize waste)
Confidence-Based Early Exit
When draft token confidence drops below threshold, set slot_mapping = -1. CUDA kernel early-returns, skipping KV write:
In reshape_and_cache kernel:
if (slot_idx < 0) return; // No DRAM write
Implementation
Core implemntation:
- vllm/config/speculative.py - Config options
- vllm/v1/cudagraph_dispatcher.py - Multi-draft-length graph support
- vllm/v1/spec_decode/eagle.py - Adaptive + early exit logic
- vllm/v1/spec_decode/metrics.py - EWMA tracking, draft selection
Environment variables:
VLLM_SPEC_ADAPTIVE_DRAFT_LENGTH = 0,1
VLLM_SPEC_CONFIDENCE_THRESHOLD = [0,1]
Config options:
SpeculativeConfig(
draft_length_options = [d_1 ... d_n], # None = auto-compute
draft_confidence_threshold = c, # 0.0 = disabled
)
Testing and Results
Implementation was tested across draft lengths (5, 10, 15), early exit thresholds (0.3, 0.5, 0.7), and two Llama target + draft model pairs and evaluated on four benchmark datasets to simulate diverse inference workloads (Alpaca, SQuAD, BigCodeBench, and CNN DailyMail).
We observe a ~20-50% reduction in latency depending on the task and configuration and confirm a threshold-proportional reduction in KV writes (~50% at 0.5 at 1-3% latency cost). In production, where as much as 70% of memory can be KV cache and as up as 50% of attention cycles stall on bandwidth constraints, this could potentially translate to latency savings as well.
Our blog post with background, methodology, and full results can be found here: https://iluvatarlabs.com/blog/2025/11/elastic-speculation/
We're excited to PR this to vLLM and happy to address any feedback!
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @yuz207.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
CC @benchislett @luccafong
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @yuz207.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This looks very interesting!
@MatthewBonanni @LucasWilkinson
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @yuz207.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork