vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[feat] Implement Elastic Speculation: Adaptive Draft Length + Confidence-Based Early Exit

Open yuz207 opened this issue 3 weeks ago • 1 comments

Purpose

We implemented Elastic Speculation, an adaptive control layer for EAGLE speculative decoding that delivers double-digit latency improvements over fixed-length speculation while reducing KV-cache DRAM traffic.

Two independent features:

  1. Adaptive Draft Length: Dynamically adjusts speculation depth based on acceptance rates
  2. Confidence-Based Early Exit: Gates KV writes for low-confidence draft tokens

Summary

Adaptive Draft Length

Captures multiple draft lengths as CUDA graphs (default: [5, 10, 15]), tracks acceptance rate via EWMA, selects based on thresholds:

70% acceptance → d = high (maximize benefit) 50-70% → d = med (balanced) 30-50% → d = low (conservative) <30% → d = very low (minimize waste)

Confidence-Based Early Exit

When draft token confidence drops below threshold, set slot_mapping = -1. CUDA kernel early-returns, skipping KV write:

In reshape_and_cache kernel:

if (slot_idx < 0) return;  // No DRAM write

Implementation

Core implemntation:

  • vllm/config/speculative.py - Config options
  • vllm/v1/cudagraph_dispatcher.py - Multi-draft-length graph support
  • vllm/v1/spec_decode/eagle.py - Adaptive + early exit logic
  • vllm/v1/spec_decode/metrics.py - EWMA tracking, draft selection

Environment variables:

VLLM_SPEC_ADAPTIVE_DRAFT_LENGTH = 0,1
VLLM_SPEC_CONFIDENCE_THRESHOLD = [0,1]

Config options:

SpeculativeConfig(
  draft_length_options = [d_1 ... d_n],  # None = auto-compute
  draft_confidence_threshold = c,    # 0.0 = disabled
)

Testing and Results

Implementation was tested across draft lengths (5, 10, 15), early exit thresholds (0.3, 0.5, 0.7), and two Llama target + draft model pairs and evaluated on four benchmark datasets to simulate diverse inference workloads (Alpaca, SQuAD, BigCodeBench, and CNN DailyMail).

We observe a ~20-50% reduction in latency depending on the task and configuration and confirm a threshold-proportional reduction in KV writes (~50% at 0.5 at 1-3% latency cost). In production, where as much as 70% of memory can be KV cache and as up as 50% of attention cycles stall on bandwidth constraints, this could potentially translate to latency savings as well.

Our blog post with background, methodology, and full results can be found here: https://iluvatarlabs.com/blog/2025/11/elastic-speculation/

We're excited to PR this to vLLM and happy to address any feedback!

yuz207 avatar Nov 14 '25 00:11 yuz207

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @yuz207.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 14 '25 00:11 mergify[bot]

CC @benchislett @luccafong

heheda12345 avatar Nov 16 '25 07:11 heheda12345

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @yuz207.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 20 '25 11:11 mergify[bot]

This looks very interesting!

robertgshaw2-redhat avatar Nov 24 '25 23:11 robertgshaw2-redhat

@MatthewBonanni @LucasWilkinson

robertgshaw2-redhat avatar Nov 24 '25 23:11 robertgshaw2-redhat

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @yuz207.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 28 '25 06:11 mergify[bot]