TensorRT-LLM batch inference is different with single

System Info

x85-64 4 A10 0.9.0

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

A private model of type llama2, when using the same input batch inference, such as batch_size=4, yields four different answers. （topk=0,topp=0,run.py）

Expected behavior

four answers should be the same

actual behavior

yields four different answers.

additional notes

Jul 03 '24 02:07 1096125073

i have disable custom_all_reduce when build engine

Jul 03 '24 02:07 1096125073

Hi @1096125073 , since different batch sizes may lead to different kernels. So, the results can be different. This is a known issue.

Jul 04 '24 08:07 QiJune

Hi @1096125073 , since different batch sizes may lead to different kernels. So, the results can be different. This is a known issue.

Thank you for your answer! I'm sorry, I think it might be because I didn't express myself clearly. When I infer that the batch is 4, the input to the batch is the same, but the four outputs I get are different.

Jul 04 '24 08:07 1096125073

@1096125073 Yes, I get your point: repeat the same input prompt 4 times, and make it a batch, but the outputs are different from batch size 1. Unfortunately, it's a known issue.

BTW, do you observe the similar phenomenon in PyTorch?

Jul 04 '24 09:07 QiJune

@1096125073 Yes, I get your point: repeat the same input prompt 4 times, and make it a batch, but the outputs are different from batch size 1. Unfortunately, it's a known issue.

BTW, do you observe the similar phenomenon in PyTorch?

Sorry, I said these four outputs are different from each other, just like the picture above.

Jul 04 '24 09:07 1096125073

Hi @1096125073 , I tried the llama2 model:

python convert_checkpoint.py --model_dir=/llm-models/llama-models-v2/llama-v2-7b-hf/ --output_dir=./ckpt --dtype bfloat16

trtllm-build --checkpoint_dir=./ckpt --output_dir=./engine --gemm_plugin bfloat16 --max_output_len=256 --max_batch_size=4

python ../run.py --engine_dir=./engine --max_output_len=10 --tokenizer_dir /llm-models/llama-models-v2/llama-v2-7b-hf/  --input_text 'How are you' 'How are you' 'How are you' 'How are you'

Here is the result:

Input [Text 0]: "<s> How are you"
Output [Text 0 Beam 0]: "doing? I hope you are doing well. I"
Input [Text 1]: "<s> How are you"
Output [Text 1 Beam 0]: "doing? I hope you are doing well. I"
Input [Text 2]: "<s> How are you"
Output [Text 2 Beam 0]: "doing? I hope you are doing well. I"
Input [Text 3]: "<s> How are you"
Output [Text 3 Beam 0]: "doing? I hope you are doing well. I"

Jul 05 '24 08:07 QiJune

@1096125073 Could you please try the main branch? It seems you are using 0.9.0 version.

Jul 05 '24 08:07 QiJune

@1096125073 Do you use multiple GPUs? If you use multi-GPU, you can use NCCL_ALGO=Tree to ensure stable reduce order. NCCL usually select Ring algo, which has unstable reduce order, which causes different results in the same batch. If you use single GPU, then it should be other issue.

Jul 05 '24 12:07 yuxianq

@QiJune I encountered the same issue with the T5 model (float16). The inference results vary slightly with different batch sizes during extensive sample testing. Is this a normal phenomenon? I saw a similar issue in one of the issues on this page: https://github.com/dmlc/gluon-nlp/issues/1344.

Aug 15 '24 09:08 0xd8b

@1096125073 Do you use multiple GPUs? If you use multi-GPU, you can use NCCL_ALGO=Tree to ensure stable reduce order. NCCL usually select Ring algo, which has unstable reduce order, which causes different results in the same batch. If you use single GPU, then it should be other issue.

Yes，this is the answer i want，thanks

Aug 23 '24 10:08 1096125073

@QiJune I'm experiencing issues even when using single GPU. If the discrepancies in results are due to varying kernel choices, is there a way to sacrifice some performance in exchange for more stability?

Aug 23 '24 10:08 chiendb97

Issue has not received an update in over 14 days. Adding stale label.

Dec 04 '24 17:12 github-actions[bot]

This issue was closed because it has been 14 days without activity since it has been marked as stale.

Dec 18 '24 18:12 github-actions[bot]

+1 here

Feb 26 '25 21:02 DZADSL72-00558