tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

decoding_mode top_k_top_p does not take effect for llama2 not same with huggingface

Open yjjiang11 opened this issue 9 months ago • 3 comments

System Info

CPU Architecture: AMD EPYC 7V13 64-Core Processor CPU/Host memory size: 440 GPU properties: A800 80GB GPU name: NVIDIA A800 80GB x2 GPU mem size: 80Gb x 2 clock frequencies Libraries TensorRT-LLM branch or tag: main TensorRT-LLM commit: https://github.com/triton-inference-server/tensorrtllm_backend/commit/ae52bce3ed8ecea468a16483e0dacd3d156ae4fe Versions of TensorRT, CUDA: (10.0.1, 12.4) container used: Built container from tensorrtllm_backend main branch using dockerfile/Dockerfile.trt_llm_backend nvidia driver version: 535.161.07 OS: Ubuntu 22.04.4 LTS docker image version: custom built from main branch

Who can help?

No response

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

  1. Build the trt llm container by running DOCKER_BUILDKIT=1 TORCH_CUDA_ARCH_LIST= docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend

  2. Launch the container with this command sudo docker run -it --net host --shm-size=20g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/keith:/home triton_trt_llm:latest /bin/bash

  3. Build trt engines following the guide https://github.com/NVIDIA/TensorRT-LLM/tree/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d/examples/llama

  4. Launch triton server

set decoding_mode: top_k_top_p

python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/path/to/llama2-70b/repo --log

  1. Query the serser twice

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "背一首诗", "max_tokens": 20, "bad_words": "", "stop_words": "</s>", "top_k": 100, "top_p": 1}'

Expected behavior

  • response1:

{ "model_name": "ensemble", "model_version": "1", "sequence_end": false, "sequence_id": 0, "sequence_start": false, "text_output": "背诵一首诗\n\n《登高》 杜甫\n\n风急天高猿啸哀,渚清沙白鸟飞回。无边落木萧萧下,不尽长江滚滚来。万里悲秋常作客,百年多病独登台。艰难苦恨繁霜鬓,潦倒新停浊酒杯。" }

  • response2 response1: { "model_name": "ensemble", "model_version": "1", "sequence_end": false, "sequence_id": 0, "sequence_start": false, "text_output": "背诵一首诗\n\n《赋得古原草送别》白居易离离原上草,一岁一枯荣。野火烧不尽,春风吹又生。远芳侵古道,晴翠接荒城。又送王孙去,萋萋满别情。。" }

actual behavior

two responses are same: { "model_name": "ensemble", "model_version": "1", "sequence_end": false, "sequence_id": 0, "sequence_start": false, "text_output": "背诵一首诗\n\n《登高》 杜甫\n\n风急天高猿啸哀,渚清沙白鸟飞回。无边落木萧萧下,不尽长江滚滚来。万里悲秋常作客,百年多病独登台。艰难苦恨繁霜鬓,潦倒新停浊酒杯。" }

additional notes

while I set decoding mode to top_p top_k The result is still no effect

yjjiang11 avatar May 16 '24 01:05 yjjiang11