TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

[Feature Request] Support for Constrained Decoding (such as generating Json formatted output)

Open silverriver opened this issue 1 year ago • 9 comments

Summary I would like to propose the addition of constrained decoding support. This feature would allow the output sequence to be constrained by a Finite State Machine (FSM) or Context-Free Grammar (CFG), providing more control over the generated sequences for various applications.

The most simple one is like the json mode provided by Openai API.

This feature is implemented in other repos, like https://github.com/outlines-dev/outlines https://github.com/guidance-ai/guidance?tab=readme-ov-file#constrained-generation

I am wondering if this feature is in the road map of trt-llm?

### Tasks

silverriver avatar Feb 20 '24 09:02 silverriver

Faster version implemented in sglang https://lmsys.org/blog/2024-02-05-compressed-fsm/

nivibilla avatar Feb 26 '24 04:02 nivibilla

Faster version implemented in sglang https://lmsys.org/blog/2024-02-05-compressed-fsm/

Yeap, the RadixAttention attention proposed in this paper is also a nice feature to have if we want to constrain the decoded sequence to a given json chema.

silverriver avatar Mar 02 '24 04:03 silverriver

Adding constrained decoding to this library (like SGLang JSON decoding) would be great, as it would allow a more reliable, faster generation. Is there any news about which release might include it?

fedem96 avatar May 17 '24 08:05 fedem96

Certainly need this functionality. With vLLM supporting constrained decoding, this could be a dealbreaker for some for TRT-LLM. Is this on the roadmap by any chance (pinging @ncomly-nvidia in case you know?)

dhruvmullick avatar Jun 06 '24 15:06 dhruvmullick

would this sample help?

mayani-nv avatar Jun 25 '24 00:06 mayani-nv

would this sample help?

helpful, but as previously mentioned, tensorrtllm inference is done in c++, whereas that library is in python.

since the inflight batcher used by the triton inference server uses the c++ implementation of trt-llm, that example cannot be used as smoothly without using the pure python inference backend.

avianion avatar Jun 25 '24 00:06 avianion

cc @AdamzNV @ncomly-nvidia @laikhtewari for vis.

hello-11 avatar Nov 15 '24 11:11 hello-11

https://github.com/guidance-ai/llgtrt might be of interest. It is native (Rust though) OpenAI compatibile rest server incorporating llguidance Rust library for constrained decoding

mmoskal avatar Nov 15 '24 15:11 mmoskal

@silverriver For constrained decoding, we have some support for this feature, and here is an example.

Superjomn avatar Mar 13 '25 00:03 Superjomn