PainlessInferenceAcceleration
PainlessInferenceAcceleration copied to clipboard
In the benchmark studies, how are the draft tokens generated?
I read with great interest your paper 'Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy'.
In essence, the paper proposes a tree data structure to verify proposed draft tokens, and in this way speed up inference.
Unfortunately, it's not clear to me from the paper how these draft tokens were generated when establishing benchmark results for LookAhead-Parallel and LookAhead-Hierarchical.
I understand the focus on the paper is on how to handle a set of draft tokens (perhaps as a single branch, perhaps in parallel, or perhaps in a hierarchical manner). But the origin of the draft tokens in the benchmark results remains unclear to me.