TensorRT-LLM Question about configurations of runtime arguments

Question about configurations of runtime arguments

Open sleepwalker2017 opened this issue 1 year ago • 3 comments

I'm benchmarking vicuna 13B using trt-llm v0.9.0 on 2*A30 GPU, and try the following configurations.

I think there are some strange points:

Enabling prefix caching alone has a positive impact on performance, while other configurations or combinations generally have negative effects.
When prefix caching and chunked prefill are enabled together, an error occurs during execution, which seems to be a bug.
The highest performance is achieved with prefix caching combined with preemption, yet enabling preemption alone has negative effects, which is quite strange.

Apr 23 '24 06:04 sleepwalker2017

@sleepwalker2017 Thanks for reporting the issues. Is it possible to provide more commands and steps to reproduce the issue, especially the 2nd point?

May 09 '24 03:05 kaiyux

branch main，commit id：66ef1df492f7bc9c8eeb01d7e14db01838e3f0bd

model=/data/vicuna-13b/vicuna-13b-v1.5/
tp=2
python convert_checkpoint.py --model_dir ${model} \
                              --output_dir ./tllm_checkpoint_2gpu_fp16 \
                              --dtype float16 --tp_size ${tp}

trtllm-build --checkpoint_dir ./tllm_checkpoint_2gpu_fp16 \
            --output_dir ./tmp/llama/13B/trt_engines/fp16/2-gpu \
            --gemm_plugin float16 \
            --use_fused_mlp \
            --max_batch_size 24 \
            --max_input_len 2048 \
            --max_output_len 256 \
            --context_fmha enable \
            --paged_kv_cache enable \
            --use_paged_context_fmha enable \
            --remove_input_padding enable  --workers ${tp} \
            --use_fused_mlp

mpirun -n 2 --allow-run-as-root ./gptManagerBenchmark --engine_dir ../../../examples/llama/tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset ../../../benchmarks/cpp/token-norm-dist.json --kv_cache_free_gpu_mem_fraction 0.85

You can generate input tokens using your scripts locally. @kaiyux

May 09 '24 06:05 sleepwalker2017

whats the flag to enable prefix caching ?

Jul 13 '24 01:07 geraldstanje

TensorRT-LLM TensorRT-LLM copied to clipboard

Question about configurations of runtime arguments

TensorRT-LLM
TensorRT-LLM copied to clipboard