Add speculative decoding support with draft models

Open AlexCheema opened this issue 1 month ago • 1 comments

Summary

This PR adds support for speculative decoding using draft models to accelerate inference. Draft models are smaller, faster models that generate candidate tokens which are then verified by the main model in parallel.

Key Features

Draft model configuration: Set draft_model and num_draft_tokens when creating an instance
Runtime updates: Change or clear draft model on running instances via API or dashboard
Parallel downloads: Draft model downloads in parallel with main model for faster startup
Dashboard UI: Draft model icon on instance cards to configure speculative decoding

Changes

Types & Events

Instance: Added draft_model and num_draft_tokens fields
SetDraftModel task: Load/clear draft models on running instances
InstanceDraftModelUpdated event: Propagate draft model changes to state
SetInstanceDraftModel command: API command for runtime updates

Worker

plan.py: Download draft models at RunnerIdle (parallel with main model)
runner.py: Load draft model during LoadModel phase, handle SetDraftModel task
main.py: Handle SetDraftModel task and update bound instance

API & Master

api.py: PUT /instance/{id}/draft_model endpoint
main.py: Handle SetInstanceDraftModel command
placement.py: Pass draft model config when creating instances

Dashboard

Draft model icon button on instance cards (both main view and chat view)
Modal to select draft model and configure token count
Visual indicator when draft model is active (cyan highlight)

Test plan

[ ] Create instance with draft model configured at launch
[ ] Create instance without draft model, add via dashboard
[ ] Clear draft model from running instance
[ ] Verify draft model shows in both instances panels
[ ] Verify parallel download of draft and main models
[ ] Run inference with speculative decoding enabled

🤖 Generated with Claude Code

Jan 18 '26 02:01 AlexCheema