Yuchao Zhang
Yuchao Zhang
llama3 should already be supported with template https://github.com/npuichigo/openai_trtllm/blob/main/templates/history_template_llama3.liquid. To get the model, please refer to https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#llama-v3-updates
it's ensemble if the structure looks like https://github.com/triton-inference-server/tensorrtllm_backend/tree/v0.9.0/all_models/inflight_batcher_llm
it's not planned yet, but I think it's trivial to adapt the codes for your use case.
Can you provide how to calling vllm-based triton backend? The grpc interface, the parameters for example to call the service.
could u set RUST_LOG to debug and attach the debug info here? https://github.com/npuichigo/openai_trtllm/blob/8e33ce19cac8b9803ce525b88475585d670fe01b/src/routes/chat.rs#L100
I tested with codellama and it indeed has no space between words. ``` $ python openai_completion.py class SimpleTransformer(nn.Module): @classmethod def add_args(cls, parser): return parser @classmethod def from_args(cls, args): $ python...
@charllll The inflight_batcher_llm_client only calls the `tensorrt_llm` model from triton instead of the `ensemble` model and does [de-tokenization](https://github.com/triton-inference-server/tensorrtllm_backend/blob/da59830baf762a2026c10535ac6459d0cb45e990/inflight_batcher_llm/client/inflight_batcher_llm_client.py#L826) itself. It seems be related to this https://github.com/triton-inference-server/tensorrtllm_backend/issues/332
@Narsil when can we have a new version which includes the proxy setting?
Any update on this?