TensorRT-LLM
TensorRT-LLM copied to clipboard
Qwen14B model result of long prompt is different with hf result
System Info
GPU: rtx8000 Diver version: 525.85.05 Cuda version: 12.0 Syetem: ubuntu20.04
Who can help?
No response
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
1、build qwen
python build.py --hf_model_dir ./tmp/Qwen/7B/ \
--dtype float16 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--enable_context_fmha \
--use_gemm_plugin float16 \
--output_dir ./tmp/Qwen/7B/trt_engines/fp16/1-gpu/
2、python3 ../run.py --input_text "long ............................." \
--max_output_len=50 \
--tokenizer_dir ./tmp/Qwen/7B/ \
--engine_dir=./tmp/Qwen/7B/trt_engines/int8_kv_cache_weight_only/1-gpu
The results are very different from those of hf.
Expected behavior
The results are similar to those of hf.
actual behavior
1
additional notes
I found a related issue https://github.com/NVIDIA/TensorRT-LLM/issues/836,Looking forward to your team’s attention and progress.
Recently I have also found there's a pretty much difference between huggingface outputs and trt_llm outputs of llama2 13B, both in fp16 precision. It's pretty hard to locate where the difference originates from. So many factors, including the sampling algorithm, transformer architecture optimization, paged attention, etc. However, I observe there's some degradation in the quality of outputs in many cases in the greedy search decoding strategy.