TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

issue with Device 0 peer access Device x is not available.

Open geraldstanje opened this issue 1 year ago • 2 comments

System Info

gpu:

Mon Apr 22 17:00:40 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1B.0 Off |                    0 |
|  0%   17C    P8              15W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G                    On  | 00000000:00:1C.0 Off |                    0 |
|  0%   16C    P8              15W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G                    On  | 00000000:00:1D.0 Off |                    0 |
|  0%   16C    P8              15W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   16C    P8              15W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
  • gpu topo:
nvidia-smi topo -m

       GPU0   GPU1   GPU2   GPU3   CPU Affinity  NUMA Affinity GPU NUMA ID
GPU0   X     PHB    PHB    PHB    0-47   0             N/A
GPU1   PHB    X     PHB    PHB    0-47   0             N/A
GPU2   PHB    PHB    X     PHB    0-47   0             N/A
GPU3   PHB    PHB    PHB    X     0-47   0             N/A

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
  • tensorrtllm_backend: 0.8.0
  • model: Llama-2-7b-chat-hf (https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
  • docker image: nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 run docker image: sudo docker run -it --ipc=host --gpus all --ulimit memlock=-1 --shm-size="2g" nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 /bin/bash

Who can help?

@kaiyux @byshiue

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

  1. run docker image: nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
  2. install tensorrt v8.0
  3. compile model
  4. run model

Expected behavior

all gpus should be visible and should be able to use to compile and run the model

actual behavior

why are the 3 other gpus not available?

[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.

logs:

./llama2_llm_tensorrt_engine_build_and_test.sh
[TensorRT-LLM] TensorRT-LLM version: 0.8.00.8.0
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.36s/it]
Weights loaded. Total time: 00:00:10
Total time of converting checkpoints: 00:02:05
[TensorRT-LLM] TensorRT-LLM version: 0.8.0[04/22/2024-16:40:34] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set gemm_plugin to float16.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set lookup_plugin to None.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set lora_plugin to None.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set context_fmha to True.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set paged_kv_cache to True.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set remove_input_padding to True.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set multi_block_mode to False.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set enable_xqa to True.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set tokens_per_block to 128.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[04/22/2024-16:40:34] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[04/22/2024-16:40:34] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 183, GPU 256 (MiB)
[04/22/2024-16:41:24] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1798, GPU +312, now: CPU 2117, GPU 568 (MiB)
[04/22/2024-16:41:24] [TRT-LLM] [I] Set nccl_plugin to None.
[04/22/2024-16:41:24] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/22/2024-16:41:25] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[04/22/2024-16:41:25] [TRT] [W] Unused Input: position_ids
[04/22/2024-16:41:25] [TRT] [W] Detected layernorm nodes in FP16.
[04/22/2024-16:41:25] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[04/22/2024-16:41:25] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[04/22/2024-16:41:25] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2153, GPU 594 (MiB)
[04/22/2024-16:41:25] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 2155, GPU 604 (MiB)
[04/22/2024-16:41:25] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[04/22/2024-16:41:25] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[04/22/2024-16:41:35] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[04/22/2024-16:41:35] [TRT] [I] Detected 106 inputs and 1 output network tensors.
[04/22/2024-16:41:40] [TRT] [I] Total Host Persistent Memory: 82640
[04/22/2024-16:41:40] [TRT] [I] Total Device Persistent Memory: 0
[04/22/2024-16:41:40] [TRT] [I] Total Scratch Memory: 537001984
[04/22/2024-16:41:40] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 619 steps to complete.
[04/22/2024-16:41:40] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 24.3962ms to assign 12 blocks to 619 nodes requiring 3238006272 bytes.
[04/22/2024-16:41:40] [TRT] [I] Total Activation Memory: 3238006272
[04/22/2024-16:41:40] [TRT] [I] Total Weights Memory: 13476831232
[04/22/2024-16:41:40] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2192, GPU 13474 (MiB)
[04/22/2024-16:41:40] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2193, GPU 13484 (MiB)
[04/22/2024-16:41:40] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[04/22/2024-16:41:40] [TRT] [I] Engine generation completed in 15.4387 seconds.
[04/22/2024-16:41:40] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 12853 MiB
[04/22/2024-16:41:40] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +12853, now: CPU 0, GPU 12853 (MiB)
[04/22/2024-16:41:47] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 28514 MiB
[04/22/2024-16:41:47] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:22
[04/22/2024-16:41:48] [TRT-LLM] [I] Serializing engine to /tensorrt/tensorrt-models/Llama-2-7b-chat-hf/v0.8.0/trt-engines/fp16/1-gpu/rank0.engine...
[04/22/2024-16:42:09] [TRT-LLM] [I] Engine serialized. Total time: 00:00:21
[04/22/2024-16:42:10] [TRT-LLM] [I] Total time of building all engines: 00:01:36
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 12855 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13001, GPU 13130 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 13002, GPU 13140 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12852, now: CPU 0, GPU 12852 (MiB)
[TensorRT-LLM][WARNING] The value of maxAttentionWindow cannot exceed maxSequenceLength. Therefore, it has been adjusted to match the value of maxSequenceLength.
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13035, GPU 16242 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 13035, GPU 16250 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB)
[TensorRT-LLM][INFO] Allocate 5972688896 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 11392 tokens in paged KV cache.
[TensorRT-LLM] TensorRT-LLM version: 0.8.0Input [Text 0]: "<s> [INST] What is deep learning? [/INST]"
Output [Text 0 Beam 0]: " Deep learning is a subfield of machine learning that involves the use of artificial neural networks to model and solve complex problems. Here are some key things to know about deep learning:

1. Artificial Neural Networks (ANNs): Deep learning algorithms are based on artificial neural networks, which are modeled after the structure and function of the human brain. ANNs consist of interconnected nodes or neurons that process inputs and produce outputs.
2. Multi-Layer Perceptron (MLP): The most common type of deep learning algorithm is the multi-layer perceptron (MLP), which consists of multiple layers of neurons with nonlinear activation functions. Each layer processes the output from the previous layer, allowing the network to learn increasingly complex patterns in the data.
3. Convolutional Neural Networks (CNNs): CNNs are a type of deep learning algorithm specifically designed for image recognition tasks. They use convolutional and pooling layers to extract features from images, followed by fully connected layers to make predictions.
4. Recurrent Neural Networks (RNNs): RNNs are a type of deep learning algorithm used for sequential data, such as"

llama2_llm_tensorrt_engine_build_and_test.sh looks like this:

#!/bin/bash

HF_MODEL_NAME="Llama-2-7b-chat-hf"
HF_MODEL_PATH="meta-llama/Llama-2-7b-chat-h"
# Clone the Hugging Face model repository
# ...
# Convert the model checkpoint to TensorRT format
python /tensorrt/v0.8.0/tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir /tensorrt/models/$HF_MODEL_NAME \
    --output_dir /tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-checkpoints/fp16/1-gpu/ \
    --dtype float16
# Build TensorRT engine
trtllm-build --checkpoint_dir /tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-checkpoints/fp16/1-gpu/ \
    --output_dir /tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-engines/fp16/1-gpu/ \
    --remove_input_padding enable \
    --context_fmha enable \
    --gemm_plugin float16 \
    --max_input_len 32768 \
    --strongly_typed
# Run inference with the TensorRT engine
python3 /tensorrt/v0.8.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
    --max_output_len=250 \
    --tokenizer_dir /tensorrt/models/$HF_MODEL_NAME \
    --engine_dir=/tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-engines/fp16/1-gpu/ \
    --max_attention_window_size=4096 \
    --temperature=0.3 \
    --top_k=50 \
    --top_p=0.9 \
    --repetition_penalty=1.2 \
    --input_text="[INST] What is deep learning? [/INST]"

also what i notices is when i measure the latency of of the run.py - it takes 21 seconds to run it - why is that so slow?

time python3 /tensorrt/v0.8.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
    --max_output_len=250 \
    --tokenizer_dir /tensorrt/models/$HF_MODEL_NAME \
    --engine_dir=/tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-engines/fp16/1-gpu/ \
    --max_attention_window_size=4096 \
    --temperature=0.3 \
    --top_k=50 \
    --top_p=0.9 \
    --repetition_penalty=1.2 \
    --input_text="[INST] What is deep learning? [/INST]"

...

real   0m21.735s
user  0m11.898s
sys    0m14.218s

additional notes

here is how i install tensorrt-llm: https://medium.com/trendyol-tech/deploying-a-large-language-model-llm-with-tensorrt-llm-on-triton-inference-server-a-step-by-step-d53fccc856fa

geraldstanje avatar Apr 22 '24 19:04 geraldstanje

For warning

[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.

it is expected due to your communication topology. If your topo is PXB, then you will not see the warning.

For the timing, if you use time to measure, it contains the loading model, allocating the buffer at the beginning, and many initialization and finalization of environment. You could add --run_profiling to see the real inference time.

byshiue avatar Apr 24 '24 07:04 byshiue

@byshiue are you sure that is only a warning and it will work with 1 and 4 gpus?

geraldstanje avatar Apr 27 '24 21:04 geraldstanje

In such topo, you need to disable the use_custom_all_reduce.

byshiue avatar May 09 '24 06:05 byshiue

Hi @byshiue,

Facing the very same error with same machine config, AWS instance of g5.12xlarge, 4 x A10g GPUs

built with cmd from llama example, using llama3 8b model

trtllm-build --checkpoint_dir ./tllm_checkpoint_2_2_gpu_awq/
--output_dir ./tmp/llama/8B/trt_engines/awq/4-gpu-0-4-ard
--gemm_plugin float16
--pp_size 4
--use_custom_all_reduce disable Tried with both config of [tp=1,pp=4] , [tp=2,pp=4]

mpirun --allow-run-as-root -n 4 python3 examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/awq/4-gpu-0-4/ --max_output_len 100 --tokenizer_dir ./Meta-Llama-3-8B-Instruct
--input_text "How do I count to nine in French?"

ed777e817fde:3947:3947 [2] NCCL INFO init.cc:1641 -> 2 ed777e817fde:3946:3946 [1] NCCL INFO Channel 00/02 : 0 1 ed777e817fde:3946:3946 [1] NCCL INFO Channel 01/02 : 0 1 ed777e817fde:3946:3946 [1] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 ed777e817fde:3946:3946 [1] NCCL INFO P2P Chunksize set to 131072 ed777e817fde:3947:3947 [2] NCCL INFO init.cc:1679 -> 2 Failed, NCCL error /home/jenkins/agent/workspace/LLM/release-0.9/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/recvPlugin.cpp:132 'unhandled system error (run with NCCL_DEBUG=INFO for details)'

ed777e817fde:3948:3948 [3] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-Z0gfZ7 to 9637892 bytes

ed777e817fde:3948:3948 [3] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-Z0gfZ7 (size 9637888) ed777e817fde:3948:3948 [3] NCCL INFO transport/shm.cc:114 -> 2 ed777e817fde:3948:3948 [3] NCCL INFO transport.cc:33 -> 2 ed777e817fde:3948:3948 [3] NCCL INFO transport.cc:97 -> 2 ed777e817fde:3948:3948 [3] NCCL INFO init.cc:1117 -> 2 ed777e817fde:3948:3948 [3] NCCL INFO init.cc:1396 -> 2 ed777e817fde:3948:3948 [3] NCCL INFO init.cc:1641 -> 2

ed777e817fde:3946:3946 [1] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-UWHrQM to 9637892 bytes

ed777e817fde:3946:3946 [1] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-UWHrQM (size 9637888) ed777e817fde:3946:3946 [1] NCCL INFO transport/shm.cc:114 -> 2 ed777e817fde:3946:3946 [1] NCCL INFO transport.cc:33 -> 2 ed777e817fde:3946:3946 [1] NCCL INFO transport.cc:97 -> 2 ed777e817fde:3946:3946 [1] NCCL INFO init.cc:1117 -> 2 ed777e817fde:3946:3946 [1] NCCL INFO init.cc:1396 -> 2 ed777e817fde:3946:3946 [1] NCCL INFO init.cc:1641 -> 2 ed777e817fde:3948:3948 [3] NCCL INFO init.cc:1679 -> 2 Failed, NCCL error /home/jenkins/agent/workspace/LLM/release-0.9/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/recvPlugin.cpp:132 'unhandled system error (run with NCCL_DEBUG=INFO for details)' ed777e817fde:3946:3946 [1] NCCL INFO init.cc:1679 -> 2 Failed, NCCL error /home/jenkins/agent/workspace/LLM/release-0.9/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/sendPlugin.cpp:135 'unhandled system error (run with NCCL_DEBUG=INFO for details)'


Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Is it just another OOM error,

Maybe I am missing something here, what is the best config to run the model inference in with trt-llm here for llama 3x8b model? any suggestions

manickavela29 avatar Aug 02 '24 07:08 manickavela29