TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

[bug] Lookahead spec-dec verifies w guesses instead of g

Open mahmoudhas opened this issue 8 months ago • 1 comments

System Info

All systems

Who can help?

@kaiyux

Information

  • [x] The official example scripts
  • [x] My own modified scripts

Tasks

  • [x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

.

Expected behavior

The pool manager makes g guesses

actual behavior

The pool manager is making w guesses

additional notes

The bug is in this line: https://github.com/NVIDIA/TensorRT-LLM/blob/b40f351b7a3f8c15a3e97b7d4e1f9459dafd8922/cpp/tensorrt_llm/layers/lookaheadAlgorithm.cpp#L183 It passes the parameter w, when it should be passing the parameter g. As a result, the algorithm verifies fewer n-grams than what is expected from the configuration.

Why did this not cause any errors so far? The LookaheadPoolManager already clips the number of guess to g here: https://github.com/NVIDIA/TensorRT-LLM/blob/b40f351b7a3f8c15a3e97b7d4e1f9459dafd8922/cpp/tensorrt_llm/layers/lookaheadPoolManager.cpp#L53 As a result, the number of n-gram guesses (with the bug) is min(w, g), so there are no more than g guesses.

We (at Baseten) tried to write our own custom LookaheadPoolManager and ran into an assertion failure at this line: https://github.com/NVIDIA/TensorRT-LLM/blob/b40f351b7a3f8c15a3e97b7d4e1f9459dafd8922/cpp/tensorrt_llm/layers/lookaheadAlgorithm.cpp#L188

This bug doesn't cause any crashes, but fixing it may improve the performance for users who set g values higher than w.

mahmoudhas avatar Apr 30 '25 23:04 mahmoudhas

@lfr-0531, do you think you can have a look?

brb-nv avatar May 16 '25 23:05 brb-nv