TensorRT-LLM Qwen14B model result of long prompt is different with hf result

Qwen14B model result of long prompt is different with hf result

Open Lzhang-hub opened this issue 1 year ago • 1 comments

System Info

GPU: rtx8000 Diver version: 525.85.05 Cuda version: 12.0 Syetem: ubuntu20.04

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

1、build qwen

python build.py --hf_model_dir ./tmp/Qwen/7B/ \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir ./tmp/Qwen/7B/trt_engines/fp16/1-gpu/

2、python3 ../run.py --input_text "long ............................." \
                  --max_output_len=50 \
                  --tokenizer_dir ./tmp/Qwen/7B/ \
                  --engine_dir=./tmp/Qwen/7B/trt_engines/int8_kv_cache_weight_only/1-gpu

The results are very different from those of hf.

Expected behavior

The results are similar to those of hf.

actual behavior

additional notes

I found a related issue https://github.com/NVIDIA/TensorRT-LLM/issues/836,Looking forward to your team’s attention and progress.

Jan 25 '24 03:01 Lzhang-hub

Recently I have also found there's a pretty much difference between huggingface outputs and trt_llm outputs of llama2 13B, both in fp16 precision. It's pretty hard to locate where the difference originates from. So many factors, including the sampling algorithm, transformer architecture optimization, paged attention, etc. However, I observe there's some degradation in the quality of outputs in many cases in the greedy search decoding strategy.

Jan 31 '24 09:01 kisseternity

TensorRT-LLM TensorRT-LLM copied to clipboard

Qwen14B model result of long prompt is different with hf result

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

TensorRT-LLM
TensorRT-LLM copied to clipboard