TensorRT-LLM Does Mistral can run on different GPUs(A100 & A5500) ?

I follow the instruction and can run Mistral with llama on GPU A100 or A5500 but I don't know how to set tp_size=2 / workers=2 to get the convert_checkpoint config I try 2 GPUs on 1 device or 2 device(with 1 GPU), both are failed If any suggestion, please let me know. Thanks for the reply.

Apr 23 '24 13:04 thj08

You could try referring the guide of standard llama or other model. To run TP, you should add --tp_size 2 during converting the ckpt. If you still encounter error, please share your scripts and the full log by following the issue template.

Apr 24 '24 06:04 byshiue

System Info

CPU architecture - x86_64
GPU properties
- GPU name: NVIDIA A100 memory size: 40G
- GPU name: NVIDIA A5500 memory size: 24G
Libraries
- TensorRT-LLM branch or tag: v0.10.0
- TensorRT-LLM commit: dev2024041600
- Container used: yes
NVIDIA driver version: 550.54.15
OS: Ubuntu 22.04

Who can help?

@byshiue

Reproduction

python3 convert_checkpoint.py --model_dir ./mistral --tp_size 2 trtllm-build --checkpoint_dir ./tllm_checkpoint --output_dir /tmp/fp16 --gemm_plugin float16 --max_input_len 32256 python3 ../run.py --max_output_len=50 --tokenizer_dir ./mistral/ --engine_dir=/tmp/fp16 --max_attention_window_size=4096

Expected behavior

run success

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600 [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600 [TensorRT-LLM][INFO] Engine version 0.10.0.dev2024041600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.10.0.dev2024041600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 2, rank: 0 [TensorRT-LLM][INFO] MPI size: 2, rank: 1 Traceback (most recent call last): File "/tmp/k8s/TensorRTLLM/examples/llama/../run.py", line 564, in main(args) File "/tmp/k8s/TensorRTLLM/examples/llama/../run.py", line 413, in main runner = runner_cls.from_dir(*runner_kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 184, in from_dir session = GptSession(config=session_config, RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaSetDevice(device): invalid device ordinal (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/runtime/utils/sessionUtils.cpp:34) 1 0x7215a5417031 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x68c031) [0x7215a5417031] 2 0x7215a6e972d7 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const, unsigned long, std::shared_ptrnvinfer1::ILogger) + 487 3 0x721617805a95 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb8a95) [0x721617805a95] 4 0x7216177b5549 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x68549) [0x7216177b5549] 5 0x721617798737 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x4b737) [0x721617798737] 6 0x5644bebab10e python3(+0x15a10e) [0x5644bebab10e] 7 0x5644beba1a7b _PyObject_MakeTpCall + 603 8 0x5644bebb9acb python3(+0x168acb) [0x5644bebb9acb] 9 0x5644bebba635 _PyObject_Call + 277 10 0x5644bebb6087 python3(+0x165087) [0x5644bebb6087] 11 0x5644beba1e2b python3(+0x150e2b) [0x5644beba1e2b] 12 0x721617797da9 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x4ada9) [0x721617797da9] 13 0x5644beba1a7b _PyObject_MakeTpCall + 603 14 0x5644beb9b150 _PyEval_EvalFrameDefault + 30112 15 0x5644bebb97f1 python3(+0x1687f1) [0x5644bebb97f1] 16 0x5644bebba492 PyObject_Call + 290 17 0x5644beb965d7 _PyEval_EvalFrameDefault + 10791 18 0x5644bebab9fc _PyFunction_Vectorcall + 124 19 0x5644beb9426d _PyEval_EvalFrameDefault + 1725 20 0x5644beb909c6 python3(+0x13f9c6) [0x5644beb909c6] 21 0x5644bec86256 PyEval_EvalCode + 134 22 0x5644becb1108 python3(+0x260108) [0x5644becb1108] 23 0x5644becaa9cb python3(+0x2599cb) [0x5644becaa9cb] 24 0x5644becb0e55 python3(+0x25fe55) [0x5644becb0e55] 25 0x5644becb0338 _PyRun_SimpleFileObject + 424 26 0x5644becaff83 _PyRun_AnyFileObject + 67 27 0x5644beca2a5e Py_RunMain + 702 28 0x5644bec7902d Py_BytesMain + 45 29 0x72178dce8d90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x72178dce8d90] 30 0x72178dce8e40 __libc_start_main + 128 31 0x5644bec78f25 _start + 37 [TensorRT-LLM][INFO] Loaded engine size: 7050 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 7210, GPU 7504 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 7212, GPU 7514 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2

additional notes

none

Apr 24 '24 12:04 thj08

You cannot run on different GPUs by tp.

Apr 25 '24 00:04 byshiue

Thank for the reply. Is there any way can run on different GPUs？ container recognize different GPUs as the same (ex. nvidia.com/gpu), is it possibe to run on different GPUs in container？

Apr 25 '24 03:04 thj08

Is pp_size also? I set pp_size=2, it doesn't show any error message, but no more information

[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600 [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600 [TensorRT-LLM][INFO] Engine version 0.10.0.dev2024041600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 2, rank: 0 [TensorRT-LLM][INFO] Engine version 0.10.0.dev2024041600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] MPI size: 2, rank: 1 [TensorRT-LLM][INFO] Loaded engine size: 6923 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 7082, GPU 7502 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 7083, GPU 7512 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 trt-deployment-6fc4ffd45-pb29j:1499:1499 [0] NCCL INFO Bootstrap : Using eth0:10.44.0.1<0> trt-deployment-6fc4ffd45-pb29j:1499:1499 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. trt-deployment-6fc4ffd45-pb29j:1499:1499 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) trt-deployment-6fc4ffd45-pb29j:1499:1499 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. trt-deployment-6fc4ffd45-pb29j:1499:1499 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)

Traceback (most recent call last): File "/tmp/k8s/TensorRTLLM/examples/llama/../run.py", line 564, in main(args) File "/tmp/k8s/TensorRTLLM/examples/llama/../run.py", line 413, in main runner = runner_cls.from_dir(*runner_kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 184, in from_dir session = GptSession(config=session_config, RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaSetDevice(device): invalid device ordinal (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/runtime/utils/sessionUtils.cpp:34) 1 0x704289817031 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x68c031) [0x704289817031] 2 0x70428b2972d7 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const, unsigned long, std::shared_ptrnvinfer1::ILogger) + 487 3 0x7042fbc05a95 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb8a95) [0x7042fbc05a95] 4 0x7042fbbb5549 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x68549) [0x7042fbbb5549] 5 0x7042fbb98737 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x4b737) [0x7042fbb98737] 6 0x57e82a1eb10e python3(+0x15a10e) [0x57e82a1eb10e] 7 0x57e82a1e1a7b _PyObject_MakeTpCall + 603 8 0x57e82a1f9acb python3(+0x168acb) [0x57e82a1f9acb] 9 0x57e82a1fa635 _PyObject_Call + 277 10 0x57e82a1f6087 python3(+0x165087) [0x57e82a1f6087] 11 0x57e82a1e1e2b python3(+0x150e2b) [0x57e82a1e1e2b] 12 0x7042fbb97da9 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x4ada9) [0x7042fbb97da9] 13 0x57e82a1e1a7b _PyObject_MakeTpCall + 603 14 0x57e82a1db150 _PyEval_EvalFrameDefault + 30112 15 0x57e82a1f97f1 python3(+0x1687f1) [0x57e82a1f97f1] 16 0x57e82a1fa492 PyObject_Call + 290 17 0x57e82a1d65d7 _PyEval_EvalFrameDefault + 10791 18 0x57e82a1eb9fc _PyFunction_Vectorcall + 124 19 0x57e82a1d426d _PyEval_EvalFrameDefault + 1725 20 0x57e82a1d09c6 python3(+0x13f9c6) [0x57e82a1d09c6] 21 0x57e82a2c6256 PyEval_EvalCode + 134 22 0x57e82a2f1108 python3(+0x260108) [0x57e82a2f1108] 23 0x57e82a2ea9cb python3(+0x2599cb) [0x57e82a2ea9cb] 24 0x57e82a2f0e55 python3(+0x25fe55) [0x57e82a2f0e55] 25 0x57e82a2f0338 _PyRun_SimpleFileObject + 424 26 0x57e82a2eff83 _PyRun_AnyFileObject + 67 27 0x57e82a2e2a5e Py_RunMain + 702 28 0x57e82a2b902d Py_BytesMain + 45 29 0x70449f0afd90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x70449f0afd90] 30 0x70449f0afe40 __libc_start_main + 128 31 0x57e82a2b8f25 _start + 37 trt-deployment-6fc4ffd45-pb29j:1499:1499 [0] NCCL INFO cudaDriverVersion 12040 NCCL version 2.19.3+cuda12.0 trt-deployment-6fc4ffd45-pb29j:1499:1499 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so trt-deployment-6fc4ffd45-pb29j:1499:1499 [0] NCCL INFO P2P plugin IBext trt-deployment-6fc4ffd45-pb29j:1499:1499 [0] NCCL INFO NET/IB : No device found. trt-deployment-6fc4ffd45-pb29j:1499:1499 [0] NCCL INFO NET/IB : No device found. trt-deployment-6fc4ffd45-pb29j:1499:1499 [0] NCCL INFO NET/Socket : Using [0]eth0:10.44.0.1<0> trt-deployment-6fc4ffd45-pb29j:1499:1499 [0] NCCL INFO Using non-device net plugin version 0 trt-deployment-6fc4ffd45-pb29j:1499:1499 [0] NCCL INFO Using network Socket

Apr 25 '24 05:04 thj08

No. For TP or PP, it must be using same GPUs now.

Apr 26 '24 05:04 byshiue