TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Guides or Tips for optimization for KV cache usage with inflight batcher

Open pfldy2850 opened this issue 1 year ago • 6 comments

Hello, tensorrt-llm team,

I have been testing the performance for the combination of int8_kv_cache + weight_only(int8) on the llama-2-7b model. (testing with TensorRT-LLM release v0.7.1)

The node contains 2 t4 GPUs and is shown below with nvidia-smi.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:3B:00.0 Off |                    0 |
| N/A   63C    P0              70W /  70W |  14402MiB / 15360MiB |     95%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:D8:00.0 Off |                    0 |
| N/A   69C    P0              70W /  70W |  14400MiB / 15360MiB |     95%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

I've been checking the logs that are printed during the process, Usage of Used KV cache blocks has stopped increasing regardless of Active Request Count.

{
  "Active Request Count": 64,
  "Context Requests": 0,
  "Free KV cache blocks": 294,
  "Generation Requests": 8,
  "Iteration Counter": 19539,
  "Max KV cache blocks": 390,
  "Max Request Count": 128,
  "MicroBatch ID": 1,
  "Runtime CPU Memory Usage": 704,
  "Runtime GPU Memory Usage": 134836176,
  "Runtime Pinned Memory Usage": 0,
  "Scheduled Requests": 8,
  "Timestamp": "01-31-2024 02:56:23",
  "Tokens per KV cache block": 128,
  "Total Context Tokens": 0,
  "Used KV cache blocks": 96
}

Also, I haven't found any documentation that describes the settings for handling inflight batches on the model server. Can you give me any guides or tips to optimize the model server?

pfldy2850 avatar Jan 31 '24 03:01 pfldy2850

hello, I was wondering if you used inflight_batching? It seems to me that triton's reasoning is very close to the time required to use trt-llm directly. Here are some of my log info, I'm not sure what's the difference.

{ "Active Request Count": 1, "Context Requests": 1, "Free KV cache blocks": 1708, "Generation Requests": 0, "Iteration Counter": 0, "Max KV cache blocks": 1709, "Max Request Count": 2, "MicroBatch ID": 0, "Runtime CPU Memory Usage": 124, "Runtime GPU Memory Usage": 1814208, "Runtime Pinned Memory Usage": 16, "Scheduled Requests": 1, "Timestamp": "02-02-2024 01:31:07", "Tokens per KV cache block": 128, "Total Context Tokens": 19, "Used KV cache blocks": 1 }

python3 build.py --model_dir=/tensorrtllm_backend13/gpt2_medium_hf/1-gpu/ \ --n_layer=96 \ --n_embd=12288 \ --n_head=96 \ --max_batch_size=2 \ --dtype float16 \ --remove_input_padding \ --enable_context_fmha \ --use_gemm_plugin float16 \ --use_gpt_attention_plugin float16 \ --output_dir=/tensorrtllm_backend13/gpt2_trt_batch2/

lyc728 avatar Feb 02 '24 01:02 lyc728

@pfldy2850 could you share your build.py command? Are you using the Triton tensorrt_llm backend? If so could you also share the config.pbtxt for the tensorrt_llm model?

You should have a look at the following documentation page: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/perf_best_practices.md which discusses some of the parameters that can be tuned to get optimal performance.

pcastonguay avatar Feb 02 '24 14:02 pcastonguay

@pcastonguay

Thanks for pointing me to the documentation about the best practices. I think my question was related to the Maximum Attention Window Size. Adjusting the max_input_len and max_output_len at build time increased the kv cache block usage.

I don't know, but I'm guessing there's a way to schedule them appropriately at runtime with max_tokens or something, rather than pre-planning them at build time. If you know anything about this, I'd be very grateful if you could let me know.

pfldy2850 avatar Feb 03 '24 13:02 pfldy2850

@lyc728

Yeah, I used inflight_batching. Since the active request count is 1 in your log, it looks like inflight batching is not being applied properly.

As far as I know, the active request count is related to max_num_sequence in tensorrt_llm's config.pbtxt. Try setting that setting to a value greater than 1.

pfldy2850 avatar Feb 03 '24 13:02 pfldy2850

When using the GUARANTEED_NO_EVICT scheduling policy (https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/perf_best_practices.md#batch-scheduler-policy) the scheduler will only schedule a request if the KV cache has enough blocks to drive that request to completion (it assumes the worst case scenario where max_output_len tokens will be generated). You can try to use the MAX_UTILIZATION which will schedule as many requests as possible at every iteration, but can cause requests to be paused later if KV cache is full.

You can also try setting a value of max_attention_window_size smaller than max_input_len + max_output_len to reduce pressure on KV cache.

pcastonguay avatar Feb 05 '24 14:02 pcastonguay

@pcastonguay

Thanks for your reply. I will take your advice and proceed with the optimization.

pfldy2850 avatar Feb 15 '24 06:02 pfldy2850