Performance Issue when using tools/llm
❓ Question
What you have already tried
Environment
Build information about Torch-TensorRT can be found by turning on debug messages
- PyTorch Version (e.g., 1.0): 2.8.0
- CPU Architecture: amd
- OS (e.g., Linux): ubuntu 22.04
- How you installed PyTorch (
conda,pip,libtorch, source): pip - Build command you used (if compiling from source): NO
- Are you using local sources or building from archives: NO
- Python version: 3.10
- CUDA version: 12.8
- GPU models and configuration: NVIDIA
- Any other relevant information: directly use torch-tensorrt 2.8.0 wheel with github 2.8.0 tag to run tools/llm
Additional context
Hi there, I tried to use tools/llm with static_cache_v2 to run qwen2.5 model, and I use such script to run:
python run_llm.py --model Qwen/Qwen2.5-0.5B-Instruct --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v2 --benchmark
when i use nsight system to profiling, I found that using static_cache_v2 would bring launch overhead to tensorrt engine in each prefill / decode block, do you have this problem too? thought this overhead is too much, almost make torch-tensorrt the same speed compared to just enable torch.compile
here is the nsys profiling result: the red line shows there is approximately 1.7ms overhead and no gpu activities at all (when disabling static_cache_v2 there is no such bubbles, thought maybe because shape copy or other operators with static_cache_v2?)
looking forward to your reply, thanks a lot!
@peri044 Hello, sorry for bothering you, but I thought you were the main contributor of this commit (hf compiled LLM), so maybe you have interest on this problem. I have some potential optimization but failed, here's what I did. I tried to combine all of kv cache into a big tensor, but still, the overhead that launching tensorrt engine is the same(1.x ms), now I think maybe ShapeEnv used by start_idx and end_idx make the torch-tensorrt runtime calculate the shape. Can we fix this by fix decode shape when there should be fixed input and fixed output shape?
Hello @ChiikawaSama , thanks for sharing this. Is this profile coming from generate_from_static_cache function (because we do initialization of position_ids / post processing of logits but no gpu activity doesn't explain this) ? Could you share exact instructions on how you profiled this for me to repro ?
Hello @ChiikawaSama , thanks for sharing this. Is this profile coming from
generate_from_static_cachefunction (because we do initialization of position_ids / post processing of logits but no gpu activity doesn't explain this) ? Could you share exact instructions on how you profiled this for me to repro ?
Hi I use exactly this script to run:
CUDA_VISIBLE_DEVICES=7 python3 run_llm.py --model Qwen/Qwen2.5-0.5B-Instruct --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v1 --benchmark
for nvtx usage:
I added this line on file tools/llm/test_run_llm.sh: L160~L161
with torch.cuda.nvtx.range("trt_model_run"): logits_keys_values = model(*input_signature)
and I think thus you can reproduce and visualize on nsight system
Thanks again for your reply