TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Qwen14B model result of long prompt is different with hf result

Open Lzhang-hub opened this issue 1 year ago • 1 comments

System Info

GPU: rtx8000 Diver version: 525.85.05 Cuda version: 12.0 Syetem: ubuntu20.04

Who can help?

No response

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

1、build qwen

python build.py --hf_model_dir ./tmp/Qwen/7B/ \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir ./tmp/Qwen/7B/trt_engines/fp16/1-gpu/

2、python3 ../run.py --input_text "long ............................." \
                  --max_output_len=50 \
                  --tokenizer_dir ./tmp/Qwen/7B/ \
                  --engine_dir=./tmp/Qwen/7B/trt_engines/int8_kv_cache_weight_only/1-gpu

The results are very different from those of hf.

Expected behavior

The results are similar to those of hf.

actual behavior

1

additional notes

I found a related issue https://github.com/NVIDIA/TensorRT-LLM/issues/836,Looking forward to your team’s attention and progress.

Lzhang-hub avatar Jan 25 '24 03:01 Lzhang-hub

Recently I have also found there's a pretty much difference between huggingface outputs and trt_llm outputs of llama2 13B, both in fp16 precision. It's pretty hard to locate where the difference originates from. So many factors, including the sampling algorithm, transformer architecture optimization, paged attention, etc. However, I observe there's some degradation in the quality of outputs in many cases in the greedy search decoding strategy.

kisseternity avatar Jan 31 '24 09:01 kisseternity