bhsueh_NV comments

Results 639 comments of


                                            bhsueh_NV

Input tensor 'host_sink_token_length' not found when launch llama2-7b.

Yes.

[Question] How to know the inference has been finished with generate_stream API?

@schetlur-nv could you give some suggestions?

Using Bert/Roberta with "tensorrtllm" backend directly ? (no Python lib like tensorrt-llm package)

Currently, backend only supports decoder model.

Using Bert/Roberta with "tensorrtllm" backend directly ? (no Python lib like tensorrt-llm package)

@robosina It it not supported yet, instead of it cannot be supported.

convert_checkpoint.py not working - safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization

Could you try loading the model by `transformers` first? It looks the issue happens on `transformers` side instead of tensorrt_llm.

convert_checkpoint.py not working - safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization

> I tried loading the model by transformers as you suggested, but it still give same error. Even tried different versions of transformers just to cross check if its a...

Getting gemmPlugin error for mixtral model on v0.8.0 while hosting on triton server

This is caused by version mismatch of trtllm for engine building and runtime. Please check the trtllm version.

curl error - triton deployment inference

Please share the full reproduced steps, including how do you build docker image, how do you build engine, launching serving and send request.

Memory avalable for KV using Triton TRT-LLM backed is lower than using TRT-LLM directly

The kv cache sizes are controlled by `max_tokens_in_paged_kv_cache` and `kv_cache_free_gpu_mem_fraction` described [here](https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#modify-the-model-configuration). Please try setting them to proper value.

Memory avalable for KV using Triton TRT-LLM backed is lower than using TRT-LLM directly

The gpu memory utilization is near 100% because the kv cache manager allocate 90% of free memory for kv cache. If you don't want so many memory for kv cache,...