TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

Performance Issue when using tools/llm

Open ChiikawaSama opened this issue 3 months ago • 3 comments

❓ Question

What you have already tried

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

  • PyTorch Version (e.g., 1.0): 2.8.0
  • CPU Architecture: amd
  • OS (e.g., Linux): ubuntu 22.04
  • How you installed PyTorch (conda, pip, libtorch, source): pip
  • Build command you used (if compiling from source): NO
  • Are you using local sources or building from archives: NO
  • Python version: 3.10
  • CUDA version: 12.8
  • GPU models and configuration: NVIDIA
  • Any other relevant information: directly use torch-tensorrt 2.8.0 wheel with github 2.8.0 tag to run tools/llm

Additional context

Hi there, I tried to use tools/llm with static_cache_v2 to run qwen2.5 model, and I use such script to run:

python run_llm.py --model Qwen/Qwen2.5-0.5B-Instruct --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v2 --benchmark

when i use nsight system to profiling, I found that using static_cache_v2 would bring launch overhead to tensorrt engine in each prefill / decode block, do you have this problem too? thought this overhead is too much, almost make torch-tensorrt the same speed compared to just enable torch.compile

here is the nsys profiling result: the red line shows there is approximately 1.7ms overhead and no gpu activities at all (when disabling static_cache_v2 there is no such bubbles, thought maybe because shape copy or other operators with static_cache_v2?)

Image

looking forward to your reply, thanks a lot!

ChiikawaSama avatar Sep 01 '25 17:09 ChiikawaSama

@peri044 Hello, sorry for bothering you, but I thought you were the main contributor of this commit (hf compiled LLM), so maybe you have interest on this problem. I have some potential optimization but failed, here's what I did. I tried to combine all of kv cache into a big tensor, but still, the overhead that launching tensorrt engine is the same(1.x ms), now I think maybe ShapeEnv used by start_idx and end_idx make the torch-tensorrt runtime calculate the shape. Can we fix this by fix decode shape when there should be fixed input and fixed output shape?

ChiikawaSama avatar Sep 02 '25 15:09 ChiikawaSama

Hello @ChiikawaSama , thanks for sharing this. Is this profile coming from generate_from_static_cache function (because we do initialization of position_ids / post processing of logits but no gpu activity doesn't explain this) ? Could you share exact instructions on how you profiled this for me to repro ?

peri044 avatar Sep 03 '25 20:09 peri044

Hello @ChiikawaSama , thanks for sharing this. Is this profile coming from generate_from_static_cache function (because we do initialization of position_ids / post processing of logits but no gpu activity doesn't explain this) ? Could you share exact instructions on how you profiled this for me to repro ?

Hi I use exactly this script to run:

CUDA_VISIBLE_DEVICES=7 python3 run_llm.py --model Qwen/Qwen2.5-0.5B-Instruct --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v1 --benchmark

for nvtx usage:

I added this line on file tools/llm/test_run_llm.sh: L160~L161

with torch.cuda.nvtx.range("trt_model_run"): logits_keys_values = model(*input_signature)

and I think thus you can reproduce and visualize on nsight system

Thanks again for your reply

ChiikawaSama avatar Sep 04 '25 08:09 ChiikawaSama