llama.cpp Feature Request: dynamic speculation (i.e. dynamic draft-max)

Feature Request: dynamic speculation (i.e. dynamic draft-max)

Open fredlas opened this issue 5 days ago • 0 comments

Prerequisites

[x] I am running the latest code. Mention the version if possible as well.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Adjust draft-max on the fly during generation to optimize speculative generation performance.

Motivation

Speculative generation works best in structured text (so, big highly predictable chunks) like code. The larger the draft-max the more speedup is possible. But then, a large draft-max wastes time when you're in less structured text. So it seems some sort of dynamic adjustment of draft-max would be ideal.

Possible Implementation

It looks like this has been tried with some success. Their heuristic seems like a decent starting point: +2 draft-max on a fully successful speculation, -1 otherwise, presumably with draft-min as a floor. I can imagine you could wring a little more performance out of something fancier, maybe get some inspiration from congestion control algorithms (they're roughly the same shape: empirically probing for an unknown window size, with feedback being success/failure at the current size).

This looks pretty easy to implement; callers of common_speculative_gen_draft() just need to adjust common_speculative_params.n_draft, and the feedback is number of tokens accepted by common_sampler_sample_and_accept_n(). I'm going to implement this at least for myself, starting in the server - just need to add a field to the server_slot struct. If there's interest in this actually getting merged, I'll also implement it for the CLI and wherever else, modify command line arguments to cleanly support using this vs static draft-max, and do some evaluations.

Feb 17 '25 20:02 fredlas

llama.cpp llama.cpp copied to clipboard

Feature Request: dynamic speculation (i.e. dynamic draft-max)

Prerequisites

Feature Description

Motivation

Possible Implementation

llama.cpp
llama.cpp copied to clipboard