bhsueh_NV

Results 639 comments of bhsueh_NV

@robosina It it not supported yet, instead of it cannot be supported.

Could you try loading the model by `transformers` first? It looks the issue happens on `transformers` side instead of tensorrt_llm.

> I tried loading the model by transformers as you suggested, but it still give same error. Even tried different versions of transformers just to cross check if its a...

This is caused by version mismatch of trtllm for engine building and runtime. Please check the trtllm version.

Please share the full reproduced steps, including how do you build docker image, how do you build engine, launching serving and send request.

The kv cache sizes are controlled by `max_tokens_in_paged_kv_cache` and `kv_cache_free_gpu_mem_fraction` described [here](https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#modify-the-model-configuration). Please try setting them to proper value.

The gpu memory utilization is near 100% because the kv cache manager allocate 90% of free memory for kv cache. If you don't want so many memory for kv cache,...