Guides or Tips for optimization for KV cache usage with inflight batcher
Hello, tensorrt-llm team,
I have been testing the performance for the combination of int8_kv_cache + weight_only(int8) on the llama-2-7b model. (testing with TensorRT-LLM release v0.7.1)
The node contains 2 t4 GPUs and is shown below with nvidia-smi.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:3B:00.0 Off | 0 |
| N/A 63C P0 70W / 70W | 14402MiB / 15360MiB | 95% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:D8:00.0 Off | 0 |
| N/A 69C P0 70W / 70W | 14400MiB / 15360MiB | 95% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
I've been checking the logs that are printed during the process,
Usage of Used KV cache blocks has stopped increasing regardless of Active Request Count.
{
"Active Request Count": 64,
"Context Requests": 0,
"Free KV cache blocks": 294,
"Generation Requests": 8,
"Iteration Counter": 19539,
"Max KV cache blocks": 390,
"Max Request Count": 128,
"MicroBatch ID": 1,
"Runtime CPU Memory Usage": 704,
"Runtime GPU Memory Usage": 134836176,
"Runtime Pinned Memory Usage": 0,
"Scheduled Requests": 8,
"Timestamp": "01-31-2024 02:56:23",
"Tokens per KV cache block": 128,
"Total Context Tokens": 0,
"Used KV cache blocks": 96
}
Also, I haven't found any documentation that describes the settings for handling inflight batches on the model server. Can you give me any guides or tips to optimize the model server?
hello, I was wondering if you used inflight_batching? It seems to me that triton's reasoning is very close to the time required to use trt-llm directly. Here are some of my log info, I'm not sure what's the difference.
{ "Active Request Count": 1, "Context Requests": 1, "Free KV cache blocks": 1708, "Generation Requests": 0, "Iteration Counter": 0, "Max KV cache blocks": 1709, "Max Request Count": 2, "MicroBatch ID": 0, "Runtime CPU Memory Usage": 124, "Runtime GPU Memory Usage": 1814208, "Runtime Pinned Memory Usage": 16, "Scheduled Requests": 1, "Timestamp": "02-02-2024 01:31:07", "Tokens per KV cache block": 128, "Total Context Tokens": 19, "Used KV cache blocks": 1 }
python3 build.py --model_dir=/tensorrtllm_backend13/gpt2_medium_hf/1-gpu/ \ --n_layer=96 \ --n_embd=12288 \ --n_head=96 \ --max_batch_size=2 \ --dtype float16 \ --remove_input_padding \ --enable_context_fmha \ --use_gemm_plugin float16 \ --use_gpt_attention_plugin float16 \ --output_dir=/tensorrtllm_backend13/gpt2_trt_batch2/
@pfldy2850 could you share your build.py command? Are you using the Triton tensorrt_llm backend? If so could you also share the config.pbtxt for the tensorrt_llm model?
You should have a look at the following documentation page: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/perf_best_practices.md which discusses some of the parameters that can be tuned to get optimal performance.
@pcastonguay
Thanks for pointing me to the documentation about the best practices.
I think my question was related to the Maximum Attention Window Size.
Adjusting the max_input_len and max_output_len at build time increased the kv cache block usage.
I don't know, but I'm guessing there's a way to schedule them appropriately at runtime with max_tokens or something, rather than pre-planning them at build time. If you know anything about this, I'd be very grateful if you could let me know.
@lyc728
Yeah, I used inflight_batching. Since the active request count is 1 in your log, it looks like inflight batching is not being applied properly.
As far as I know, the active request count is related to max_num_sequence in tensorrt_llm's config.pbtxt. Try setting that setting to a value greater than 1.
When using the GUARANTEED_NO_EVICT scheduling policy (https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/perf_best_practices.md#batch-scheduler-policy) the scheduler will only schedule a request if the KV cache has enough blocks to drive that request to completion (it assumes the worst case scenario where max_output_len tokens will be generated). You can try to use the MAX_UTILIZATION which will schedule as many requests as possible at every iteration, but can cause requests to be paused later if KV cache is full.
You can also try setting a value of max_attention_window_size smaller than max_input_len + max_output_len to reduce pressure on KV cache.
@pcastonguay
Thanks for your reply. I will take your advice and proceed with the optimization.