Aurelien Chartier comments

Results 32 comments of


                                            Aurelien Chartier

speculative decoding not work

Please use the following example to use draft-target speculative decoding with run.py: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/draft_target_model In your example, you are missing the `draft_target_model_config` argument.

Missing lookAheadRuntimeConfig in Triton Server with TensorRT-LLM backend HTTP Request

The following inputs tensor need to be provided: `lookahead_window_size`, `lookahead_ngram_size` and `lookahead_verification_set_size`. You can check the implementation of the lookahead_config in inflight_batcher_llm/client/inflight_batcher_llm_client.py for reference.

How to serve multiple TensorRT-LLM models in the same process / server?

We are working on adding multiple models in the Triton backend using MPI processes. A similar approach could be used to implement support with a `GptManager` per process.

How to serve multiple TensorRT-LLM models in the same process / server?

Multi-model support was part of the v0.9 release. See https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#launch-triton-server and the section regarding the --multi-model option.

How to serve multiple TensorRT-LLM models in the same process / server?

Ideally yes. The TRT-LLM Triton backend does not check if there is an overlap, so it will let you deploy multiple models on a single GPU, but you'll need to...

How to serve multiple TensorRT-LLM models in the same process / server?

Yes, see the link to the documentation in my April 16 message.

How to serve multiple TensorRT-LLM models in the same process / server?

> What parameters can control the size of KV cache and forward inference GPU memory buffer? Using the executor API, this is controlled by the KvCacheConfig class: https://github.com/NVIDIA/TensorRT-LLM/blob/548b5b73106aaf7374955e1c37aad677678ebc7b/cpp/include/tensorrt_llm/executor/executor.h#L859

Aurelien Chartier

speculative decoding not work

Missing lookAheadRuntimeConfig in Triton Server with TensorRT-LLM backend HTTP Request

How to serve multiple TensorRT-LLM models in the same process / server?

How to serve multiple TensorRT-LLM models in the same process / server?

How to serve multiple TensorRT-LLM models in the same process / server?

How to serve multiple TensorRT-LLM models in the same process / server?

How to serve multiple TensorRT-LLM models in the same process / server?

Is MPI required even multi device is disabled?

Is MPI required even multi device is disabled?

chore: upgrade transformers to 4.50.0