TensorRT-LLM [Bug] Zero temperature curl request affects non-zero temperature requests

System Info

GPU: NVIDIA A100 Driver Version: 545.23.08 CUDA: 12.3 versions:

https://github.com/NVIDIA/TensorRT-LLM.git (bf0a5af) https://github.com/triton-inference-server/tensorrtllm_backend.git (ae52bce3ed8ecea468a16483e0dacd3d156ae4fe)

Model: zephyr-7b-beta

Who can help?

@kaiyux

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

step 1:

python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir zephyr-7b-beta --output_dir zephyr-7b-beta-converted --dtype float16

step 2:

trtllm-build --checkpoint_dir zephyr-7b-beta-converted
--output_dir zephyr-7b-beta-trt-engine
--remove_input_padding enable
--context_fmha enable
--gpt_attention_plugin float16
--gemm_plugin float16
--paged_kv_cache enable
--max_num_tokens 65536
--max_batch_size 32
--max_input_len 16384
--strongly_typed

step 3 tensorrtllm_backend parameters:

MODEL_PATH=zephyr-7b-beta MODEL_PIPELINE_NAME=triton_model_repo MAX_BATCH_SIZE=32 ENGINE_PATH=zephyr-7b-beta-trt-engine MAX_ATTENTION_WINDOW_SIZE=4096 KV_CACHE_FREE_GPU_MEM_FRACTION=0.5 batch_scheduler_policy=guaranteed_no_evict python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:zephyr-7b-beta/,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1 python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/postprocessing/config.pbtxt tokenizer_dir:${MODEL_PATH},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1 python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE} python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,batch_scheduler_policy:${batch_scheduler_policy} python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/code/tensorrtllm_backend/${MODEL_PIPELINE_NAME} --http_port=8081 --log --log-file ${MODEL_PIPELINE_NAME}_triton_log.txt

step 4:

A correct curl test (run this in a loop fashion so that you can send a bad request in the middle of the process): curl -X POST http://127.0.0.1:8888/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "top_p":1, "top_k":0, "temperature":0.7}'

At the same time, send a bad curl request with zero temperature: curl -X POST http://127.0.0.1:8888/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "top_p":1, "top_k":0, "temperature":0.0}'

Expected behavior

The good curl request should get a response: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\n machinery learning is a subset of artificial intelligence that focuses on enabling computer systems to automatically learn and improve"}%

while the bad one should return 400 error.

actual behavior

Both the good and the bad request get 400 error:

400 Client Error: Bad Request for url: http://127.0.0.1:8888/v2/models/ensemble/generate resp: {"error":"in ensemble 'ensemble', Encountered error for requestId 1627051478: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: temperature penalty param (0.000000) is out of limits (0.000000, 340282346638528859811704183484516925440.000000] (/app/tensorrt_llm/cpp/tensorrt_llm/layers/fillBuffers.h:64)\n1 0x7f3978267f71 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102\n2 0x7f38af8f0624 void tensorrt_llm::layers::FillBuffers::operator()(std::optional<std::vector<float, std::allocator > > const&, float, std::vector<float, std::allocator >&, float*, int const*, std::pair<float, float> const&, std::string const&) const + 324\n3 0x7f38af8f0c47 tensorrt_llm::layers::DynamicDecodeLayer::setupPenalties(int, int const*, tensorrt_llm::layers::DynamicDecodeLayer::SetupParams const&) + 1223\n4 0x7f38af901b17 tensorrt_llm::layers::DynamicDecodeLayer::setup(int, int, int const*, tensorrt_llm::layers::DynamicDecodeLayer::SetupParams const&) + 167\n5 0x7f38af96f04c tensorrt_llm::runtime::GptDecoder::setup(tensorrt_llm::runtime::SamplingConfig const&, unsigned long, int, std::optional<std::shared_ptr<tensorrt_llm::runtime::ITensor> > const&) + 572\n6 0x7f38af97dba3 tensorrt_llm::runtime::GptDecoderBatch::newRequests(std::vector<int, std::allocator > const&, std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const&, std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator<tensorrt_llm::runtime::SamplingConfig> > const&) + 483\n7 0x7f38afc5fadf tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 719\n8 0x7f38afc61d0a tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 5434\n9 0x7f38afc12854 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less, std::allocator >&) + 36\n10 0x7f38afc1a984 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 404\n11 0x7f3989df2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f3989df2253]\n12 0x7f3989b81ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f3989b81ac3]\n13 0x7f3989c13850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f3989c13850]"}

additional notes

None

May 20 '24 03:05 Hao-YunDeng

same issue @kaiyux

May 20 '24 07:05 dafu-wu

Same issue , can you help to research this issue? thanks @kaiyux

May 21 '24 02:05 fan-niu

@kaiyux any progress on this issue? Thanks

May 28 '24 06:05 Hao-YunDeng

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

Jun 28 '24 01:06 github-actions[bot]

TensorRT-LLM TensorRT-LLM copied to clipboard

[Bug] Zero temperature curl request affects non-zero temperature requests

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

TensorRT-LLM
TensorRT-LLM copied to clipboard