TensorRT-LLM result is different from 0.9.0 and 0.10.0，and speed has decreased when update version

result is different from 0.9.0 and 0.10.0，and speed has decreased when update version

Open sundayKK opened this issue 1 year ago • 1 comments

trafficstars

System Info

CPU X86 GPU A100 OS Redhat Driver 535.154.05

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

i use the same model:vicuna-7b-v1.3 medusa-vicuna-7b-v1.3, when i use version of 0.9.0 with image: nvidia/cuda:12.1.0 , input ' Once upon' ,response and speed of output token like: 企业微信截图_17206069313952 but i update version to 0.10.0 and use image 12.4.0, response is changed and speed decreased. like: 企业微信截图_17206081811032 and i just use vllm to use the same model, and i can get the same response with version of 0.9.0, why update version the result has changed and speed decreased? thanks~ i noticed the differences between the two version is temperature，0.9.0 use tem=0.0 , 0.10.0 use tem=1.0