TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Question about configurations of runtime arguments

Open sleepwalker2017 opened this issue 1 year ago • 3 comments

I'm benchmarking vicuna 13B using trt-llm v0.9.0 on 2*A30 GPU, and try the following configurations.

image

I think there are some strange points:

  • Enabling prefix caching alone has a positive impact on performance, while other configurations or combinations generally have negative effects.
  • When prefix caching and chunked prefill are enabled together, an error occurs during execution, which seems to be a bug.
  • The highest performance is achieved with prefix caching combined with preemption, yet enabling preemption alone has negative effects, which is quite strange.

sleepwalker2017 avatar Apr 23 '24 06:04 sleepwalker2017

@sleepwalker2017 Thanks for reporting the issues. Is it possible to provide more commands and steps to reproduce the issue, especially the 2nd point?

kaiyux avatar May 09 '24 03:05 kaiyux

branch main,commit id:66ef1df492f7bc9c8eeb01d7e14db01838e3f0bd

model=/data/vicuna-13b/vicuna-13b-v1.5/
tp=2
python convert_checkpoint.py --model_dir ${model} \
                              --output_dir ./tllm_checkpoint_2gpu_fp16 \
                              --dtype float16 --tp_size ${tp}

trtllm-build --checkpoint_dir ./tllm_checkpoint_2gpu_fp16 \
            --output_dir ./tmp/llama/13B/trt_engines/fp16/2-gpu \
            --gemm_plugin float16 \
            --use_fused_mlp \
            --max_batch_size 24 \
            --max_input_len 2048 \
            --max_output_len 256 \
            --context_fmha enable \
            --paged_kv_cache enable \
            --use_paged_context_fmha enable \
            --remove_input_padding enable  --workers ${tp} \
            --use_fused_mlp

mpirun -n 2 --allow-run-as-root ./gptManagerBenchmark --engine_dir ../../../examples/llama/tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset ../../../benchmarks/cpp/token-norm-dist.json --kv_cache_free_gpu_mem_fraction 0.85

You can generate input tokens using your scripts locally. @kaiyux

sleepwalker2017 avatar May 09 '24 06:05 sleepwalker2017

whats the flag to enable prefix caching ?

geraldstanje avatar Jul 13 '24 01:07 geraldstanje