TensorRT-LLM [Feature] Prompt lookup speculative decoding for LLM API

It looks like the model runner API supports prompt lookup speculative decoding: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/prompt_lookup

However, it doesn't seem to be part of the LLM API yet: https://github.com/NVIDIA/TensorRT-LLM/blob/3ee4332fb183bf09a8a8a577bb3dd9a8e68f29f6/tensorrt_llm/llmapi/llm_args.py#L851-L854

Mar 28 '25 06:03 tonyay163

Hi @tonyay163,

Thanks for bringing this to our attention. It is true that prompt lookup speculative decoding is not exposed in the LLM API level now. Recently we are working to make the LLM API stable enough to be ready for the official TensorRT-LLM 1.0 release, so for now we may not be able to do the work to expose prompt lookup speculative decoding in LLM API.

If you have interest, you are welcome to contribute the code to TensorRT-LLM directly.

@Superjomn for vis on this.

Mar 28 '25 08:03 juney-nvidia

Thanks for the quick response @juney-nvidia, is there an example PR where the other ones were implemented that I can refer to?

Mar 28 '25 08:03 tonyay163

Hi @tonyay163, I am afraid the major MRs were internal before we switched to GitHub. Recently, we have been focusing on the pytorch path, and here are some related code I know

cc @lfr-0531 if there is more information about contributing to Pytorch's speculative part.

Mar 28 '25 08:03 Superjomn

@tonyay163

As @Superjomn said, we are now focusing on the PyTorch path to improve the ease-of-use of TensorRT-LLM(with still ensuring the best performance). Also since there is already Prompt Lookup speculative decoding support in the TensorRT path, you can decide whether you want to implement it in the PyTorch path(by following the MTP example shared by Superjomm) or you want to expose the current prompt lookup implementation in TensorRT path to LLM API.

In our design, both TensorRT and PyTorch path details can be hidden by the LLM API, so as long as you are using LLM API, the switching between TensorRT and PyTorch should be relatively seamless to end-users(there can still be some cases that based on LLM API when you switch from TensorRT to PyTorch path some user side changes are needed but the code change should be very small).

Pls let me know whether this is clear enough to you.

Thanks June

Mar 28 '25 10:03 juney-nvidia

Issue has not received an update in over 14 days. Adding stale label.

Sep 22 '25 23:09 github-actions[bot]

This issue was closed because it has been 14 days without activity since it has been marked as stale.

Oct 07 '25 03:10 github-actions[bot]