TensorRT-LLM
TensorRT-LLM copied to clipboard
result is different from 0.9.0 and 0.10.0,and speed has decreased when update version
System Info
CPU X86 GPU A100 OS Redhat Driver 535.154.05
Who can help?
No response
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
i use the same model:vicuna-7b-v1.3 medusa-vicuna-7b-v1.3, when i use version of 0.9.0 with image: nvidia/cuda:12.1.0 , input '
Once upon' ,response and speed of output token like:
but i update version to 0.10.0 and use image 12.4.0, response is changed and speed decreased. like:
and i just use vllm to use the same model, and i can get the same response with version of 0.9.0, why update version the result has changed and speed decreased? thanks~
i noticed the differences between the two version is temperature,0.9.0 use tem=0.0 , 0.10.0 use tem=1.0
Expected behavior
update version ,speed should be imporved or remain consistent with old version. and model result should not changed.
actual behavior
update version ,result is different . and speed slowed down.
additional notes
as Reproduction
I see the same issue with Llama-3 70B, v0.10.0 engine runs 0.5-1.5 seconds slower than the same engine in v0.9.0.
@sundayKK, please try to use the latest version of TrtLLM.