tensorrtllm_backend tensorrt_llm_bls disregards top

System Info

Triton + TRT-LLM 0.9.0, llama2 70b model, fp8 quantization, run on 2xH100 80GB, tp 2, pp 1 config.pbtxt for tensorrt_llm_bls (otherwise unchanged):

parameters: {
  key: "accumulate_tokens"
  value: {
    string_value: "true"
  }
}
parameters: {
  key: "tensorrt_llm_model_name"
  value: {
    string_value: "tensorrt_llm"
  }
}

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "temperature": 100.0, "top_k": 100}'

output

{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"tensorrt_llm_bls","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"text_output":"- Quora\nMachine learning is a field of artificial intelligence which enables machines to learn without being specifically"}

Expected behavior

given temperature 100.0 and top_k 100, one would expect a nonsensical (and not the canonical) answer

actual behavior

see above, the reproduction part

additional notes

Ensemble model works as expected. I have sent the following request to the same running engine just a few seconds after the tensorrt_llm_bls request above:

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "temperature": 100.0, "top_k": 100}'

output:

{"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"Crussischroo-Nau ™ Technocrord! Evaluuj poddano"}

May 23 '24 14:05 janpetrov

Hi，Did you solve this problem？ I met the same problem with using TRT-LLM 0.9.0 and TRT-LLM 0.7.0.

Aug 06 '24 08:08 Elissa0723

@Elissa0723 Hi, I am sorry for the late reply. Not yet. We did not need tensorrt_llm_bls (streaming) for some time. I expect to give it a try after TRT-LLM 0.12.0 is released. Besides, there could be similar issue when inferring using grpc, but I need to investigate this first.

Aug 30 '24 10:08 janpetrov

Hi @janpetrov @Elissa0723 , the "disregarding" of temperature should be fixed with this PR: https://github.com/triton-inference-server/tensorrtllm_backend/pull/578. You should see that temperature is now correctly passed when using the BLS model: https://github.com/triton-inference-server/tensorrtllm_backend/blob/edf17484f98e64d0ec1d267323d3a478d72decdb/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py#L401, whereas it wasn't being passed before this PR.

Oct 09 '24 19:10 rmccorm4

thank you for mentioning and repairing this, works now

Oct 09 '24 19:10 janpetrov

tensorrt_llm_bls disregards top_k / temperature setting

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes