Increasing max_new_tokens drastically slows down inference
System Info
DGX H100 setup
Hi, I am facing an issue with the generation parameter "max_new_tokens" when going to infer Llama2. My goal with TensorRT-LLM is to do many inferences over a dataset. However, I am seeing a significant slowdown when I increase the max_num_tokens parameter. I understand that it may be a bit slower, however I am seeing 7-8x slowdown. That doesn't seem right to me, but please let me know if this is expected behavior.
Additionally, I ran the run.py script with low and high max_new_tokens values. I found that the high max_num_tokens was only slightly slower when using that script.
If I can provide with more information about the original task please let me know and I'm happy to provide more information. Thanks!
Who can help?
@kaiyux
Information
- [X] The official example scripts
- [X] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Engine build information:
# Convert checkpoint
srun --mpi=pmi2 \
--container-image=/mnt/weka/images/llama_inference/$CONT \
--container-mounts $MOUNTS \
--container-workdir /workspace/inference/llama/ \
python convert_checkpoint.py --model_dir $MODEL_DIR \
--output_dir $CHECKPOINT_OUTPUT_DIR \
--dtype float16 \
--tp_size $tensor_parallelism \
--pp_size $pipeline_parallelism \
# Build engines
srun --mpi=pmi2 \
--container-image=/mnt/weka/images/llama_inference/$CONT \
--container-mounts $MOUNTS \
--container-workdir /workspace/inference/llama/ \
trtllm-build --checkpoint_dir $CHECKPOINT_OUTPUT_DIR \
--output_dir $ENGINE_OUTPUT_DIR \
--max_batch_size $batch_size \
--max_num_tokens $max_num_tokens \
--context_fmha enable \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--remove_input_padding enable \
--paged_kv_cache enable
run.py inputs: (max_new_tokens is directly set to max_output_len in run.py)
srun --mpi=pmi2 \
--container-image=/mnt/weka/images/llama_inference/$CONT \
--container-mounts $MOUNTS \
--container-workdir /workspace/inference/llama/ \
python ../run.py --run_profiling --engine_dir=$ENGINE_DIR \
--max_output_len 100 --tokenizer_dir $MODEL_DIR --input_text "How do I count to nine in French?"
vs
srun --mpi=pmi2 \
--container-image=/mnt/weka/images/llama_inference/$CONT \
--container-mounts $MOUNTS \
--container-workdir /workspace/inference/llama/ \
python ../run.py --run_profiling --engine_dir=$ENGINE_DIR \
--max_output_len 1024 --tokenizer_dir $MODEL_DIR --input_text "How do I count to nine in French?"
Results:
output len / max_new_tokens = 100 batch_size: 1, avg latency of 10 iterations: : 1.3830650568008422 sec
output len / max_new_tokens = 1024 batch_size: 1, avg latency of 10 iterations: : 1.9917121410369873 sec
Expected behavior
When I increase max_new_tokens, model inference time increases only slightly.
actual behavior
When I increase max_new_tokens, model inference time increases significantly. It is 7-8x slower.
additional notes
No
@kaiyux Hi, is there any update on this? Thanks!
Hello @noahnisbet ! Is this issue still relevant? @kaiyux any updates on this?
Issue has not received an update in over 14 days. Adding stale label.
Closing this issue as stale, but please feel free to open new one if the problem persists. Thank you!