tensorrtllm_backend
tensorrtllm_backend copied to clipboard
lora_cache_gpu_memory_fraction is not a good parameter
I want to run the tensorrt_llm program on the server, but I want this execution to be independent of the GPU conditions, the type of GPU, or the amount of free GPU memory. However, the lora_cache_gpu_memory_fraction parameter looks at the available GPU memory and allocates a percentage of it for LoRA. This causes the program execution to depend on the type of GPU or the amount of free GPU memory. Please, if possible, define another parameter that can replace this one, allowing us to specify a fixed amount, such as 1 GB, to be allocated for LoRA. This way, the memory allocation will always be constant and independent of the execution environment.