[bug] Lookahead spec-dec verifies w guesses instead of g
System Info
All systems
Who can help?
@kaiyux
Information
- [x] The official example scripts
- [x] My own modified scripts
Tasks
- [x] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
.
Expected behavior
The pool manager makes g guesses
actual behavior
The pool manager is making w guesses
additional notes
The bug is in this line: https://github.com/NVIDIA/TensorRT-LLM/blob/b40f351b7a3f8c15a3e97b7d4e1f9459dafd8922/cpp/tensorrt_llm/layers/lookaheadAlgorithm.cpp#L183 It passes the parameter w, when it should be passing the parameter g. As a result, the algorithm verifies fewer n-grams than what is expected from the configuration.
Why did this not cause any errors so far? The LookaheadPoolManager already clips the number of guess to g here: https://github.com/NVIDIA/TensorRT-LLM/blob/b40f351b7a3f8c15a3e97b7d4e1f9459dafd8922/cpp/tensorrt_llm/layers/lookaheadPoolManager.cpp#L53 As a result, the number of n-gram guesses (with the bug) is min(w, g), so there are no more than g guesses.
We (at Baseten) tried to write our own custom LookaheadPoolManager and ran into an assertion failure at this line: https://github.com/NVIDIA/TensorRT-LLM/blob/b40f351b7a3f8c15a3e97b7d4e1f9459dafd8922/cpp/tensorrt_llm/layers/lookaheadAlgorithm.cpp#L188
This bug doesn't cause any crashes, but fixing it may improve the performance for users who set g values higher than w.
@lfr-0531, do you think you can have a look?