tensorrtllm_backend How to solve the problem of errors when loading qwen1.5-7B (using two GPUs) and llama3-8B (using two GPUs) models simultaneously using tritonserver?

How to solve the problem of errors when loading qwen1.5-7B (using two GPUs) and llama3-8B (using two GPUs) models simultaneously using tritonserver?

Open ChengShuting opened this issue 8 months ago • 3 comments

System Info

env: NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 V100 16G*8 docker images: nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

command: mpirun -n 2 --allow-run-as-root tritonserver --model-control-mode=explicit --modelepository=/data/multi_model_repo/ --load-model=Qwen1.5-7B-Chat --load-model=Llama3-8B-Chinese-Chat

Expected behavior

actual behavior

Two models can be loaded successfully, but when I call the Qwen1.5-7B Chat model using the openai interface, an error occurs

additional notes

Jun 21 '24 02:06 ChengShuting

tensorrtllm_backend tensorrtllm_backend copied to clipboard

How to solve the problem of errors when loading qwen1.5-7B (using two GPUs) and llama3-8B (using two GPUs) models simultaneously using tritonserver?

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

tensorrtllm_backend
tensorrtllm_backend copied to clipboard