TensorRT-LLM
TensorRT-LLM copied to clipboard
[Bug] Zero temperature curl request affects non-zero temperature requests
System Info
GPU: NVIDIA A100 Driver Version: 545.23.08 CUDA: 12.3 versions:
https://github.com/NVIDIA/TensorRT-LLM.git (bf0a5af) https://github.com/triton-inference-server/tensorrtllm_backend.git (ae52bce3ed8ecea468a16483e0dacd3d156ae4fe)
Model: zephyr-7b-beta
Who can help?
@kaiyux
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
step 1:
python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir zephyr-7b-beta --output_dir zephyr-7b-beta-converted --dtype float16
step 2:
trtllm-build --checkpoint_dir zephyr-7b-beta-converted
--output_dir zephyr-7b-beta-trt-engine
--remove_input_padding enable
--context_fmha enable
--gpt_attention_plugin float16
--gemm_plugin float16
--paged_kv_cache enable
--max_num_tokens 65536
--max_batch_size 32
--max_input_len 16384
--strongly_typed
step 3 tensorrtllm_backend parameters:
MODEL_PATH=zephyr-7b-beta MODEL_PIPELINE_NAME=triton_model_repo MAX_BATCH_SIZE=32 ENGINE_PATH=zephyr-7b-beta-trt-engine MAX_ATTENTION_WINDOW_SIZE=4096 KV_CACHE_FREE_GPU_MEM_FRACTION=0.5 batch_scheduler_policy=guaranteed_no_evict python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:zephyr-7b-beta/,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1 python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/postprocessing/config.pbtxt tokenizer_dir:${MODEL_PATH},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1 python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE} python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,batch_scheduler_policy:${batch_scheduler_policy} python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/code/tensorrtllm_backend/${MODEL_PIPELINE_NAME} --http_port=8081 --log --log-file ${MODEL_PIPELINE_NAME}_triton_log.txt
step 4:
A correct curl test (run this in a loop fashion so that you can send a bad request in the middle of the process): curl -X POST http://127.0.0.1:8888/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "top_p":1, "top_k":0, "temperature":0.7}'
At the same time, send a bad curl request with zero temperature: curl -X POST http://127.0.0.1:8888/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "top_p":1, "top_k":0, "temperature":0.0}'
Expected behavior
The good curl request should get a response: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\n machinery learning is a subset of artificial intelligence that focuses on enabling computer systems to automatically learn and improve"}%
while the bad one should return 400 error.
actual behavior
Both the good and the bad request get 400 error:
400 Client Error: Bad Request for url: http://127.0.0.1:8888/v2/models/ensemble/generate
resp: {"error":"in ensemble 'ensemble', Encountered error for requestId 1627051478: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: temperature penalty param (0.000000) is out of limits (0.000000, 340282346638528859811704183484516925440.000000] (/app/tensorrt_llm/cpp/tensorrt_llm/layers/fillBuffers.h:64)\n1 0x7f3978267f71 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102\n2 0x7f38af8f0624 void tensorrt_llm::layers::FillBuffers::operator()
additional notes
None
same issue @kaiyux
Same issue , can you help to research this issue? thanks @kaiyux
@kaiyux any progress on this issue? Thanks
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."