tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

tensorrt_llm_bls disregards top_k / temperature setting

Open janpetrov opened this issue 1 year ago • 1 comments

System Info

Triton + TRT-LLM 0.9.0, llama2 70b model, fp8 quantization, run on 2xH100 80GB, tp 2, pp 1 config.pbtxt for tensorrt_llm_bls (otherwise unchanged):

parameters: {
  key: "accumulate_tokens"
  value: {
    string_value: "true"
  }
}
parameters: {
  key: "tensorrt_llm_model_name"
  value: {
    string_value: "tensorrt_llm"
  }
}

Who can help?

No response

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "temperature": 100.0, "top_k": 100}'

output

{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"tensorrt_llm_bls","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"text_output":"- Quora\nMachine learning is a field of artificial intelligence which enables machines to learn without being specifically"}

Expected behavior

given temperature 100.0 and top_k 100, one would expect a nonsensical (and not the canonical) answer

actual behavior

see above, the reproduction part

additional notes

Ensemble model works as expected. I have sent the following request to the same running engine just a few seconds after the tensorrt_llm_bls request above:

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "temperature": 100.0, "top_k": 100}'

output:

{"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"Crussischroo-Nau ™ Technocrord! Evaluuj poddano"}

janpetrov avatar May 23 '24 14:05 janpetrov

Hi,Did you solve this problem? I met the same problem with using TRT-LLM 0.9.0 and TRT-LLM 0.7.0.

Elissa0723 avatar Aug 06 '24 08:08 Elissa0723

@Elissa0723 Hi, I am sorry for the late reply. Not yet. We did not need tensorrt_llm_bls (streaming) for some time. I expect to give it a try after TRT-LLM 0.12.0 is released. Besides, there could be similar issue when inferring using grpc, but I need to investigate this first.

janpetrov avatar Aug 30 '24 10:08 janpetrov

Hi @janpetrov @Elissa0723 , the "disregarding" of temperature should be fixed with this PR: https://github.com/triton-inference-server/tensorrtllm_backend/pull/578. You should see that temperature is now correctly passed when using the BLS model: https://github.com/triton-inference-server/tensorrtllm_backend/blob/edf17484f98e64d0ec1d267323d3a478d72decdb/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py#L401, whereas it wasn't being passed before this PR.

rmccorm4 avatar Oct 09 '24 19:10 rmccorm4

thank you for mentioning and repairing this, works now

janpetrov avatar Oct 09 '24 19:10 janpetrov