tensorrt_llm_bls disregards top_k / temperature setting
System Info
Triton + TRT-LLM 0.9.0, llama2 70b model, fp8 quantization, run on 2xH100 80GB, tp 2, pp 1 config.pbtxt for tensorrt_llm_bls (otherwise unchanged):
parameters: {
key: "accumulate_tokens"
value: {
string_value: "true"
}
}
parameters: {
key: "tensorrt_llm_model_name"
value: {
string_value: "tensorrt_llm"
}
}
Who can help?
No response
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "temperature": 100.0, "top_k": 100}'
output
{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"tensorrt_llm_bls","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"text_output":"- Quora\nMachine learning is a field of artificial intelligence which enables machines to learn without being specifically"}
Expected behavior
given temperature 100.0 and top_k 100, one would expect a nonsensical (and not the canonical) answer
actual behavior
see above, the reproduction part
additional notes
Ensemble model works as expected. I have sent the following request to the same running engine just a few seconds after the tensorrt_llm_bls request above:
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "temperature": 100.0, "top_k": 100}'
output:
{"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"Crussischroo-Nau ™ Technocrord! Evaluuj poddano"}
Hi,Did you solve this problem? I met the same problem with using TRT-LLM 0.9.0 and TRT-LLM 0.7.0.
@Elissa0723
Hi, I am sorry for the late reply. Not yet. We did not need tensorrt_llm_bls (streaming) for some time. I expect to give it a try after TRT-LLM 0.12.0 is released. Besides, there could be similar issue when inferring using grpc, but I need to investigate this first.
Hi @janpetrov @Elissa0723 , the "disregarding" of temperature should be fixed with this PR: https://github.com/triton-inference-server/tensorrtllm_backend/pull/578. You should see that temperature is now correctly passed when using the BLS model: https://github.com/triton-inference-server/tensorrtllm_backend/blob/edf17484f98e64d0ec1d267323d3a478d72decdb/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py#L401, whereas it wasn't being passed before this PR.
thank you for mentioning and repairing this, works now