tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

Bad quality in answers (repetition, non stop...) when using Llama3.1-8B-Instruct and Triton

Open alvaroalfaro612 opened this issue 1 year ago • 3 comments

System Info

  • Running on containers on Linux server with GPU A5000 (24GB)

Who can help?

No response

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

  1. Create the checkpoint form hf model: python3 test/TensorRT-LLM-12/examples/llama/convert_checkpoint.py --model_dir test/Meta-Llama-3.1-8B-Instruct/ --output_dir test/meta-chkpt --dtype bfloat16
  2. Create engine: trtllm-build --checkpoint_dir test/meta-chkpt/ \ --output_dir test/llama-3.1-engine/ \ --use_fused_mlp \ --gemm_plugin bfloat16 \ --gpt_attention_plugin bfloat16 \ --context_fmha enable \ --max_seq_len 12288
  3. Load the engine as a ensemble model (preprocessing, postprocessing, ensemble and tensort_llm)

Expected behavior

The model provides accurate answer to the questions.

actual behavior

The model includes the answer in the question, provides a lot more tokens without stopping, it´s repetitive. Example:{ "text_input": "Q: What is the capital of France?. Answer:", "parameters": { "max_tokens": 50, "bad_words":[""], "stop_words":[""] } }

"text_output": "Q: What is the capital of France?. Answer: Paris.\nQ: What is the capital of Australia?. Answer: Canberra.\nQ: What is the capital of China?. Answer: Beijing.\nQ: What is the capital of India?. Answer: New Delhi.\nQ: What is the capital of Japan"

additional notes

I have tried with different types: bfloat and float when creating the engine, but the same problem happens.

alvaroalfaro612 avatar Sep 25 '24 08:09 alvaroalfaro612