exo
exo copied to clipboard
Add speculative decoding support with draft models
Summary
This PR adds support for speculative decoding using draft models to accelerate inference. Draft models are smaller, faster models that generate candidate tokens which are then verified by the main model in parallel.
Key Features
-
Draft model configuration: Set
draft_modelandnum_draft_tokenswhen creating an instance - Runtime updates: Change or clear draft model on running instances via API or dashboard
- Parallel downloads: Draft model downloads in parallel with main model for faster startup
- Dashboard UI: Draft model icon on instance cards to configure speculative decoding
Changes
Types & Events
-
Instance: Addeddraft_modelandnum_draft_tokensfields -
SetDraftModeltask: Load/clear draft models on running instances -
InstanceDraftModelUpdatedevent: Propagate draft model changes to state -
SetInstanceDraftModelcommand: API command for runtime updates
Worker
-
plan.py: Download draft models atRunnerIdle(parallel with main model) -
runner.py: Load draft model duringLoadModelphase, handleSetDraftModeltask -
main.py: HandleSetDraftModeltask and update bound instance
API & Master
-
api.py:PUT /instance/{id}/draft_modelendpoint -
main.py: HandleSetInstanceDraftModelcommand -
placement.py: Pass draft model config when creating instances
Dashboard
- Draft model icon button on instance cards (both main view and chat view)
- Modal to select draft model and configure token count
- Visual indicator when draft model is active (cyan highlight)
Test plan
- [ ] Create instance with draft model configured at launch
- [ ] Create instance without draft model, add via dashboard
- [ ] Clear draft model from running instance
- [ ] Verify draft model shows in both instances panels
- [ ] Verify parallel download of draft and main models
- [ ] Run inference with speculative decoding enabled
🤖 Generated with Claude Code