tensorrtllm_backend How to calculate the number of loras that can be cached to host cache?

How to calculate the number of loras that can be cached to host cache?

Open limertang opened this issue 1 year ago • 0 comments

System Info

GPU Name: NVIDIA A800 TensorRT-LLM: 0.11.0 Nvidia Driver: 535.129.03 OS: Ubuntu 22.04 triton-inference-server backend：tensorrtllm_backend

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Inference with lora，https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/inflight_batcher_llm#running-lora-inference-with-inflight-batching
base model:qwen1.5-7b-chat,
lora rank is 8，so the size of lora weight is (4x4096x2x8+3x(4096+11008)x8)x32x2 byte = 38.125MB;
steps as below：
I set the host cache parameter "lora_cache_host_memory_bytes" to 39976960 ,set "lora_cache_gpu_memory_fraction" to 0.1 and start service. log info as below:

[TensorRT-LLM][INFO] Using 39976960 bytes for LoRA host cache [TensorRT-LLM][INFO] Using 312836096 bytes for LoRA device cache [TensorRT-LLM][INFO] Max LoRA size is 19988480 [TensorRT-LLM][INFO] LoRA host Cache can hold 1 max sized LoRAs [TensorRT-LLM][INFO] LoRA device Cache can hold 8 max sized LoRAs

send request with lora, error occured:

[TensorRT-LLM][ERROR] Encountered an error when fetching new request: Error storing task=1 in PEFT cache -- Cache is full. There are no done tasks to evict (/home/jenkins/agent/workspace/LLM/release-0.11/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:243) 1 0x7f20d8f960a0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x74c0a0) [0x7f20d8f960a0] 2 0x7f20dac724e0 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::updatePeftCache(std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 64 3 0x7f20daca3258 tensorrt_llm::executor::Executor::Impl::fetchNewRequests(int) + 2968 4 0x7f20daca4627 tensorrt_llm::executor::Executor::Impl::executionLoop() + 455

modify the "lora_cache_host_memory_bytes" to 104857600(100MB) and restart service,log as below:

[TensorRT-LLM][INFO] Using 104857600 bytes for LoRA host cache [TensorRT-LLM][INFO] Using 312836096 bytes for LoRA device cache [TensorRT-LLM][INFO] Max LoRA size is 19988480 [TensorRT-LLM][INFO] LoRA host Cache can hold 3 max sized LoRAs [TensorRT-LLM][INFO] LoRA device Cache can hold 8 max sized LoRAs

Theoretically, host cache can only hold 2 lora. 100MB//38.125MB = 2. but the log is 3, I think the log is wrong.

I send first request with lora-1 and send 2th request with lora-2, both request worked well. I think the 2 loras had been cached to host cache.

so, I send third request with lora-1,only with lora_task_id,without weight and config,but error occured:

[TensorRT-LLM][WARNING] LoRA task 1 not found in cache. Please send LoRA weights with request

Then I send 4th request with lora-2,only with lora_task_id, it worked fine.
That is to say, lora-1 had been evicted, I want to know why?

If I set "lora_cache_host_memory_bytes" to a larger value, step3 would worked well.

Expected behavior

The number of lora which host cache can hold is same as lora_cache_host_memory_bytes//lora_size

actual behavior

The number of loras which host cache can hold is not same as lora_cache_host_memory_bytes//lora_size

additional notes

If I set lora_cache_host_memory_bytes to 1G, I want to know how many loras can be cached Exactly.

Aug 08 '24 07:08 limertang

tensorrtllm_backend tensorrtllm_backend copied to clipboard

How to calculate the number of loras that can be cached to host cache?

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

tensorrtllm_backend
tensorrtllm_backend copied to clipboard