[bug] Lookahead spec-dec verifies w guesses instead of g

Open mahmoudhas opened this issue 8 months ago • 1 comments

System Info

All systems

Who can help?

@kaiyux

Information

[x] The official example scripts
[x] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Expected behavior

The pool manager makes g guesses

actual behavior

The pool manager is making w guesses

additional notes

The bug is in this line: https://github.com/NVIDIA/TensorRT-LLM/blob/b40f351b7a3f8c15a3e97b7d4e1f9459dafd8922/cpp/tensorrt_llm/layers/lookaheadAlgorithm.cpp#L183 It passes the parameter w, when it should be passing the parameter g. As a result, the algorithm verifies fewer n-grams than what is expected from the configuration.

Why did this not cause any errors so far? The LookaheadPoolManager already clips the number of guess to g here: https://github.com/NVIDIA/TensorRT-LLM/blob/b40f351b7a3f8c15a3e97b7d4e1f9459dafd8922/cpp/tensorrt_llm/layers/lookaheadPoolManager.cpp#L53 As a result, the number of n-gram guesses (with the bug) is min(w, g), so there are no more than g guesses.

We (at Baseten) tried to write our own custom LookaheadPoolManager and ran into an assertion failure at this line: https://github.com/NVIDIA/TensorRT-LLM/blob/b40f351b7a3f8c15a3e97b7d4e1f9459dafd8922/cpp/tensorrt_llm/layers/lookaheadAlgorithm.cpp#L188

This bug doesn't cause any crashes, but fixing it may improve the performance for users who set g values higher than w.

Apr 30 '25 23:04 mahmoudhas

@lfr-0531, do you think you can have a look?

May 16 '25 23:05 brb-nv