tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

How to calculate the number of cached loras

Open limertang opened this issue 1 year ago • 0 comments

System Info

GPU Name: NVIDIA A800 TensorRT-LLM: 0.11.0 Nvidia Driver: 535.129.03 OS: Ubuntu 22.04 triton-inference-server backend:tensorrtllm_backend

Who can help?

No response

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

  • Inference with lora,https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/inflight_batcher_llm#running-lora-inference-with-inflight-batching

  • base model:qwen1.5-7b-chat,

  • lora rank is 8,so the size of lora weight is (4x4096x2x8+3x(4096+11008)x8)x32x2 byte = 38.125MB;

  • steps as below:

  • I set the host cache parameter "lora_cache_host_memory_bytes" to 39976960 ,set "lora_cache_gpu_memory_fraction" to 0.1 and start service. log info as below:

[TensorRT-LLM][INFO] Using 39976960 bytes for LoRA host cache [TensorRT-LLM][INFO] Using 312836096 bytes for LoRA device cache [TensorRT-LLM][INFO] Max LoRA size is 19988480 [TensorRT-LLM][INFO] LoRA host Cache can hold 1 max sized LoRAs [TensorRT-LLM][INFO] LoRA device Cache can hold 8 max sized LoRAs

  • send request with lora, error occured:

[TensorRT-LLM][ERROR] Encountered an error when fetching new request: Error storing task=1 in PEFT cache -- Cache is full. There are no done tasks to evict (/home/jenkins/agent/workspace/LLM/release-0.11/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:243) 1 0x7f20d8f960a0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x74c0a0) [0x7f20d8f960a0] 2 0x7f20dac724e0 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::updatePeftCache(std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 64 3 0x7f20daca3258 tensorrt_llm::executor::Executor::Impl::fetchNewRequests(int) + 2968 4 0x7f20daca4627 tensorrt_llm::executor::Executor::Impl::executionLoop() + 455

  • modify the "lora_cache_host_memory_bytes" to 104857600(100MB) and restart service,log as below:

[TensorRT-LLM][INFO] Using 104857600 bytes for LoRA host cache [TensorRT-LLM][INFO] Using 312836096 bytes for LoRA device cache [TensorRT-LLM][INFO] Max LoRA size is 19988480 [TensorRT-LLM][INFO] LoRA host Cache can hold 3 max sized LoRAs [TensorRT-LLM][INFO] LoRA device Cache can hold 8 max sized LoRAs

Theoretically, host cache can only hold 2 lora. 100MB//38.125MB = 2. but the log is 3, I think the log is wrong.

  • I send first request with lora-1 and send 2th request with lora-2, both request worked well. I think the 2 loras had been cached to host cache.

    so, I send third request with lora-1,only with lora_task_id,without weight and config,but error occured:

[TensorRT-LLM][WARNING] LoRA task 1 not found in cache. Please send LoRA weights with request

Then I send 4th request with lora-2,only with lora_task_id, it worked fine.
That is to say, lora-1 had been evicted, I want to know why?

  • If I set "lora_cache_host_memory_bytes" to a larger value, step3 would worked well.

Expected behavior

The number of loras which host cache can hold is same as lora_cache_host_memory_bytes//lora_size

actual behavior

The number of loras which host cache can hold is not same as lora_cache_host_memory_bytes//lora_size

additional notes

If I set lora_cache_host_memory_bytes to 1G, I want to know how many loras can be cached Exactly.

limertang avatar Aug 08 '24 07:08 limertang