tensorrtllm_backend
tensorrtllm_backend copied to clipboard
How to calculate the number of loras that can be cached to host cache?
System Info
GPU Name: NVIDIA A800 TensorRT-LLM: 0.11.0 Nvidia Driver: 535.129.03 OS: Ubuntu 22.04 triton-inference-server backend:tensorrtllm_backend
Who can help?
No response
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
-
Inference with lora,https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/inflight_batcher_llm#running-lora-inference-with-inflight-batching
-
base model:qwen1.5-7b-chat,
-
lora rank is 8,so the size of lora weight is (4x4096x2x8+3x(4096+11008)x8)x32x2 byte = 38.125MB;
-
steps as below:
-
I set the host cache parameter "lora_cache_host_memory_bytes" to 39976960 ,set "lora_cache_gpu_memory_fraction" to 0.1 and start service. log info as below:
[TensorRT-LLM][INFO] Using 39976960 bytes for LoRA host cache [TensorRT-LLM][INFO] Using 312836096 bytes for LoRA device cache [TensorRT-LLM][INFO] Max LoRA size is 19988480 [TensorRT-LLM][INFO] LoRA host Cache can hold 1 max sized LoRAs [TensorRT-LLM][INFO] LoRA device Cache can hold 8 max sized LoRAs
- send request with lora, error occured:
[TensorRT-LLM][ERROR] Encountered an error when fetching new request: Error storing task=1 in PEFT cache -- Cache is full. There are no done tasks to evict (/home/jenkins/agent/workspace/LLM/release-0.11/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:243) 1 0x7f20d8f960a0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x74c0a0) [0x7f20d8f960a0] 2 0x7f20dac724e0 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::updatePeftCache(std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 64 3 0x7f20daca3258 tensorrt_llm::executor::Executor::Impl::fetchNewRequests(int) + 2968 4 0x7f20daca4627 tensorrt_llm::executor::Executor::Impl::executionLoop() + 455
- modify the "lora_cache_host_memory_bytes" to 104857600(100MB) and restart service,log as below:
[TensorRT-LLM][INFO] Using 104857600 bytes for LoRA host cache [TensorRT-LLM][INFO] Using 312836096 bytes for LoRA device cache [TensorRT-LLM][INFO] Max LoRA size is 19988480 [TensorRT-LLM][INFO] LoRA host Cache can hold 3 max sized LoRAs [TensorRT-LLM][INFO] LoRA device Cache can hold 8 max sized LoRAs
Theoretically, host cache can only hold 2 lora. 100MB//38.125MB = 2. but the log is 3, I think the log is wrong.
-
I send first request with lora-1 and send 2th request with lora-2, both request worked well. I think the 2 loras had been cached to host cache.
so, I send third request with lora-1,only with lora_task_id,without weight and config,but error occured:
[TensorRT-LLM][WARNING] LoRA task 1 not found in cache. Please send LoRA weights with request
Then I send 4th request with lora-2,only with lora_task_id, it worked fine.
That is to say, lora-1 had been evicted, I want to know why?
- If I set "lora_cache_host_memory_bytes" to a larger value, step3 would worked well.
Expected behavior
The number of lora which host cache can hold is same as lora_cache_host_memory_bytes//lora_size
actual behavior
The number of loras which host cache can hold is not same as lora_cache_host_memory_bytes//lora_size
additional notes
If I set lora_cache_host_memory_bytes to 1G, I want to know how many loras can be cached Exactly.