TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

batch inference is different with single

Open 1096125073 opened this issue 1 year ago • 9 comments

System Info

x85-64 4 A10 0.9.0

Who can help?

No response

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

A private model of type llama2, when using the same input batch inference, such as batch_size=4, yields four different answers. (topk=0,topp=0,run.py)

Expected behavior

four answers should be the same

actual behavior

yields four different answers.

additional notes

image

1096125073 avatar Jul 03 '24 02:07 1096125073

i have disable custom_all_reduce when build engine

1096125073 avatar Jul 03 '24 02:07 1096125073

Hi @1096125073 , since different batch sizes may lead to different kernels. So, the results can be different. This is a known issue.

QiJune avatar Jul 04 '24 08:07 QiJune

Hi @1096125073 , since different batch sizes may lead to different kernels. So, the results can be different. This is a known issue.

Thank you for your answer! I'm sorry, I think it might be because I didn't express myself clearly. When I infer that the batch is 4, the input to the batch is the same, but the four outputs I get are different.

1096125073 avatar Jul 04 '24 08:07 1096125073

@1096125073 Yes, I get your point: repeat the same input prompt 4 times, and make it a batch, but the outputs are different from batch size 1. Unfortunately, it's a known issue.

BTW, do you observe the similar phenomenon in PyTorch?

QiJune avatar Jul 04 '24 09:07 QiJune

@1096125073 Yes, I get your point: repeat the same input prompt 4 times, and make it a batch, but the outputs are different from batch size 1. Unfortunately, it's a known issue.

BTW, do you observe the similar phenomenon in PyTorch?

Sorry, I said these four outputs are different from each other, just like the picture above.

1096125073 avatar Jul 04 '24 09:07 1096125073

Hi @1096125073 , I tried the llama2 model:

python convert_checkpoint.py --model_dir=/llm-models/llama-models-v2/llama-v2-7b-hf/ --output_dir=./ckpt --dtype bfloat16

trtllm-build --checkpoint_dir=./ckpt --output_dir=./engine --gemm_plugin bfloat16 --max_output_len=256 --max_batch_size=4

python ../run.py --engine_dir=./engine --max_output_len=10 --tokenizer_dir /llm-models/llama-models-v2/llama-v2-7b-hf/  --input_text 'How are you' 'How are you' 'How are you' 'How are you'

Here is the result:

Input [Text 0]: "<s> How are you"
Output [Text 0 Beam 0]: "doing? I hope you are doing well. I"
Input [Text 1]: "<s> How are you"
Output [Text 1 Beam 0]: "doing? I hope you are doing well. I"
Input [Text 2]: "<s> How are you"
Output [Text 2 Beam 0]: "doing? I hope you are doing well. I"
Input [Text 3]: "<s> How are you"
Output [Text 3 Beam 0]: "doing? I hope you are doing well. I"

QiJune avatar Jul 05 '24 08:07 QiJune

@1096125073 Could you please try the main branch? It seems you are using 0.9.0 version.

QiJune avatar Jul 05 '24 08:07 QiJune

@1096125073 Do you use multiple GPUs? If you use multi-GPU, you can use NCCL_ALGO=Tree to ensure stable reduce order. NCCL usually select Ring algo, which has unstable reduce order, which causes different results in the same batch. If you use single GPU, then it should be other issue.

yuxianq avatar Jul 05 '24 12:07 yuxianq

@QiJune I encountered the same issue with the T5 model (float16). The inference results vary slightly with different batch sizes during extensive sample testing. Is this a normal phenomenon? I saw a similar issue in one of the issues on this page: https://github.com/dmlc/gluon-nlp/issues/1344.

0xd8b avatar Aug 15 '24 09:08 0xd8b

@1096125073 Do you use multiple GPUs? If you use multi-GPU, you can use NCCL_ALGO=Tree to ensure stable reduce order. NCCL usually select Ring algo, which has unstable reduce order, which causes different results in the same batch. If you use single GPU, then it should be other issue.

Yes,this is the answer i want,thanks

1096125073 avatar Aug 23 '24 10:08 1096125073

@QiJune I'm experiencing issues even when using single GPU. If the discrepancies in results are due to varying kernel choices, is there a way to sacrifice some performance in exchange for more stability?

chiendb97 avatar Aug 23 '24 10:08 chiendb97

Issue has not received an update in over 14 days. Adding stale label.

github-actions[bot] avatar Dec 04 '24 17:12 github-actions[bot]

This issue was closed because it has been 14 days without activity since it has been marked as stale.

github-actions[bot] avatar Dec 18 '24 18:12 github-actions[bot]

+1 here

DZADSL72-00558 avatar Feb 26 '25 21:02 DZADSL72-00558