exo icon indicating copy to clipboard operation
exo copied to clipboard

Add speculative decoding support with draft models

Open AlexCheema opened this issue 1 month ago • 1 comments

Summary

This PR adds support for speculative decoding using draft models to accelerate inference. Draft models are smaller, faster models that generate candidate tokens which are then verified by the main model in parallel.

Key Features

  • Draft model configuration: Set draft_model and num_draft_tokens when creating an instance
  • Runtime updates: Change or clear draft model on running instances via API or dashboard
  • Parallel downloads: Draft model downloads in parallel with main model for faster startup
  • Dashboard UI: Draft model icon on instance cards to configure speculative decoding

Changes

Types & Events

  • Instance: Added draft_model and num_draft_tokens fields
  • SetDraftModel task: Load/clear draft models on running instances
  • InstanceDraftModelUpdated event: Propagate draft model changes to state
  • SetInstanceDraftModel command: API command for runtime updates

Worker

  • plan.py: Download draft models at RunnerIdle (parallel with main model)
  • runner.py: Load draft model during LoadModel phase, handle SetDraftModel task
  • main.py: Handle SetDraftModel task and update bound instance

API & Master

  • api.py: PUT /instance/{id}/draft_model endpoint
  • main.py: Handle SetInstanceDraftModel command
  • placement.py: Pass draft model config when creating instances

Dashboard

  • Draft model icon button on instance cards (both main view and chat view)
  • Modal to select draft model and configure token count
  • Visual indicator when draft model is active (cyan highlight)

Test plan

  • [ ] Create instance with draft model configured at launch
  • [ ] Create instance without draft model, add via dashboard
  • [ ] Clear draft model from running instance
  • [ ] Verify draft model shows in both instances panels
  • [ ] Verify parallel download of draft and main models
  • [ ] Run inference with speculative decoding enabled

🤖 Generated with Claude Code

AlexCheema avatar Jan 18 '26 02:01 AlexCheema