TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Increasing max_new_tokens drastically slows down inference

Open noahnisbet opened this issue 1 year ago • 1 comments

System Info

DGX H100 setup

Hi, I am facing an issue with the generation parameter "max_new_tokens" when going to infer Llama2. My goal with TensorRT-LLM is to do many inferences over a dataset. However, I am seeing a significant slowdown when I increase the max_num_tokens parameter. I understand that it may be a bit slower, however I am seeing 7-8x slowdown. That doesn't seem right to me, but please let me know if this is expected behavior.

Additionally, I ran the run.py script with low and high max_new_tokens values. I found that the high max_num_tokens was only slightly slower when using that script.

If I can provide with more information about the original task please let me know and I'm happy to provide more information. Thanks!

Who can help?

@kaiyux

Information

  • [X] The official example scripts
  • [X] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

Engine build information:

# Convert checkpoint
srun --mpi=pmi2 \
    --container-image=/mnt/weka/images/llama_inference/$CONT \
    --container-mounts $MOUNTS \
    --container-workdir /workspace/inference/llama/ \
    python convert_checkpoint.py --model_dir $MODEL_DIR \
            --output_dir $CHECKPOINT_OUTPUT_DIR \
            --dtype float16 \
            --tp_size $tensor_parallelism \
            --pp_size $pipeline_parallelism \

# Build engines
srun --mpi=pmi2 \
    --container-image=/mnt/weka/images/llama_inference/$CONT \
    --container-mounts $MOUNTS \
    --container-workdir /workspace/inference/llama/ \
	trtllm-build --checkpoint_dir $CHECKPOINT_OUTPUT_DIR \
            --output_dir $ENGINE_OUTPUT_DIR \
            --max_batch_size $batch_size \
            --max_num_tokens $max_num_tokens \
            --context_fmha enable \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16 \
            --remove_input_padding enable \
            --paged_kv_cache enable

run.py inputs: (max_new_tokens is directly set to max_output_len in run.py)

srun --mpi=pmi2 \
    --container-image=/mnt/weka/images/llama_inference/$CONT \
    --container-mounts $MOUNTS \
    --container-workdir /workspace/inference/llama/ \
        python ../run.py --run_profiling --engine_dir=$ENGINE_DIR \
	--max_output_len 100 --tokenizer_dir $MODEL_DIR --input_text "How do I count to nine in French?"

vs

srun --mpi=pmi2 \
    --container-image=/mnt/weka/images/llama_inference/$CONT \
    --container-mounts $MOUNTS \
    --container-workdir /workspace/inference/llama/ \
        python ../run.py --run_profiling --engine_dir=$ENGINE_DIR \
	--max_output_len 1024 --tokenizer_dir $MODEL_DIR --input_text "How do I count to nine in French?"

Results:

output len / max_new_tokens = 100 batch_size: 1, avg latency of 10 iterations: : 1.3830650568008422 sec

output len / max_new_tokens = 1024 batch_size: 1, avg latency of 10 iterations: : 1.9917121410369873 sec

Expected behavior

When I increase max_new_tokens, model inference time increases only slightly.

actual behavior

When I increase max_new_tokens, model inference time increases significantly. It is 7-8x slower.

additional notes

No

noahnisbet avatar Mar 06 '24 16:03 noahnisbet

@kaiyux Hi, is there any update on this? Thanks!

noahnisbet avatar Mar 14 '24 16:03 noahnisbet

Hello @noahnisbet ! Is this issue still relevant? @kaiyux any updates on this?

poweiw avatar Jul 29 '25 21:07 poweiw

Issue has not received an update in over 14 days. Adding stale label.

github-actions[bot] avatar Nov 05 '25 03:11 github-actions[bot]

Closing this issue as stale, but please feel free to open new one if the problem persists. Thank you!

karljang avatar Nov 14 '25 17:11 karljang