TensorRT-LLM The problem of repeated output of large models in llama3

The problem of repeated output of large models in llama3

Open qimingyangyang opened this issue 5 months ago • 2 comments

Hello everyone, I have a problem and would like to ask for help. After I compile and run the inference code run.py, if I set max_output_len to a small value, the output will be truncated before it is complete. I can understand this. But why if I set it to a large value, such as 1000, the model will keep repeating the output, and there will be a marker like <|begin_of_text|>, until 1000 tokens; What is the reason for this? Is there any parameter that is not set correctly? python3 examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/fp16/1-gpu/
--max_output_len 500
--tokenizer_dir /TensorRT-LLM/Meta-Llama-3-8B-Instruct
--input_text "What is machine learning, only output 20 words"

Sep 24 '24 11:09 qimingyangyang

TensorRT-LLM TensorRT-LLM copied to clipboard

The problem of repeated output of large models in llama3

TensorRT-LLM
TensorRT-LLM copied to clipboard