Aurelien Chartier

Results 32 comments of Aurelien Chartier

Please use the following example to use draft-target speculative decoding with run.py: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/draft_target_model In your example, you are missing the `draft_target_model_config` argument.

The following inputs tensor need to be provided: `lookahead_window_size`, `lookahead_ngram_size` and `lookahead_verification_set_size`. You can check the implementation of the lookahead_config in inflight_batcher_llm/client/inflight_batcher_llm_client.py for reference.

We are working on adding multiple models in the Triton backend using MPI processes. A similar approach could be used to implement support with a `GptManager` per process.

Multi-model support was part of the v0.9 release. See https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#launch-triton-server and the section regarding the --multi-model option.

Ideally yes. The TRT-LLM Triton backend does not check if there is an overlap, so it will let you deploy multiple models on a single GPU, but you'll need to...

Yes, see the link to the documentation in my April 16 message.

> What parameters can control the size of KV cache and forward inference GPU memory buffer? Using the executor API, this is controlled by the KvCacheConfig class: https://github.com/NVIDIA/TensorRT-LLM/blob/548b5b73106aaf7374955e1c37aad677678ebc7b/cpp/include/tensorrt_llm/executor/executor.h#L859

Could you try with the following option to build_wheel.py ``` --extra-cmake-vars ENABLE_MULTI_DEVICE=0 ```

Fair enough. If building for a specific target architecture, `-a native` can provide a significant build time reduction.