TensorRT-LLM Cpp runner outputs wrong results when using lora + tensor parallelism

Cpp runner outputs wrong results when using lora + tensor parallelism

Open ShuaiShao93 opened this issue 10 months ago • 1 comments

trafficstars

System Info

x86_64, debian 11, A100 GPUs

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

On a VM with 2 A100 GPUs:

pip3 install tensorrt_llm==0.16.0 --extra-index-url https://pypi.nvidia.com/
git clone -b v0.16.0 https://github.com/NVIDIA/TensorRT-LLM.git
git clone https://huggingface.co/unsloth/Llama-3.2-3B-Instruct
git clone https://huggingface.co/ss-galileo/llama-3.2-3B-lora
Run commands below to build the engine

python3 TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir ./Llama-3.2-3B-Instruct --output_dir ./tllm_3b_checkpoint_2gpu_fp16 --dtype float16 --tp_size=2

trtllm-build --checkpoint_dir ./tllm_3b_checkpoint_2gpu_fp16 --output_dir ./tmp/llama/3B/trt_engines/fp16/2-gpu  --gpt_attention_plugin auto  --gemm_plugin auto --max_num_tokens 128000 --max_batch_size 8 --logits_dtype=float32 --gather_generation_logits --kv_cache_type=paged --lora_plugin auto  --lora_dir llama-3.2-3B-lora/

Run model with cpp runner

mpirun -n 2 python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/3B/trt_engines/fp16/2-gpu --max_output_len 100 --max_input_length=100000 --tokenizer_dir ./Llama-3.2-3B-Instruct --input_text "is sky blue?" --lora_dir llama-3.2-3B-lora/ --lora_task_uids 0

Got results

Input [Text 0]: "<|begin_of_text|>is sky blue?"
Output [Text 0 Beam 0]: " 1.0. 1.0. 2. 1.0. 3. 1. 4. 1. 5. 1. 6. 1. 7. 1. 8. 1. 9. 1. 10. 1. 11. 1. 12. 1. 13. 1. 14. 1. 15. 1. 16. "

Run model with python runner

mpirun -n 2 python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/3B/trt_engines/fp16/2-gpu --max_output_len 100 --max_input_length=100000 --tokenizer_dir ./Llama-3.2-3B-Instruct --input_text "is sky blue?" --lora_dir llama-3.2-3B-lora/ --lora_task_uids 0 --use_py_session

Got results

Input [Text 0]: "<|begin_of_text|>is sky blue?"
Output [Text 0 Beam 0]: " (a) yes (b) sky is not blue
The question is not about the color of the sky, but about the color of the sky at a particular time of day. The sky appears blue during the daytime, but it can appear different colors at sunrise and sunset. So, the correct answer is (b) sky is not blue.
This question requires the ability to analyze the situation and understand the context, which is a key aspect of critical thinking. It also requires the ability to distinguish"

Expected behavior

Python runner and cpp runner should give same results

actual behavior

Python runner and cpp runner give totally different results, and the results from cpp runners are apparently wrong

additional notes

N/A

Dec 28 '24 00:12 ShuaiShao93

I meet the same perblem that cpp runner get diffrent output compare with python runner。 but I don`t know why

Mar 28 '25 09:03 HPUedCSLearner

TensorRT-LLM TensorRT-LLM copied to clipboard

Cpp runner outputs wrong results when using lora + tensor parallelism

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

TensorRT-LLM
TensorRT-LLM copied to clipboard