TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

System hangs when I use multiple GPUs

Open yirunwang opened this issue 1 year ago • 2 comments

Single GPU is OK, System hangs when I use multiple GPUs. Can someone help solve this? Thanks.

python build.py --model_dir meta-llama/Llama-2-7b-chat-hf
--dtype float16
--remove_input_padding
--use_gpt_attention_plugin float16
--enable_context_fmha
--use_gemm_plugin float16
--output_dir ./tmp/llama/7B/trt_engines/fp16/4-gpu/
--world_size 4
--tp_size 4

mpirun -n 4 --allow-run-as-root
python ../summarize.py --test_trt_llm
--hf_model_dir meta-llama/Llama-2-7b-chat-hf
--data_type fp16
--engine_dir ./tmp/llama/7B/trt_engines/fp16/4-gpu/

image

yirunwang avatar Jan 06 '24 08:01 yirunwang

possible solutions #149

BasicCoder avatar Jan 12 '24 08:01 BasicCoder

@QiJune I am experiencing a similar problem when running benchmarks. multi-gpu just hangs after engine build and loaded

mikuts avatar Feb 01 '24 17:02 mikuts

@QiJune I have the same problem when launching [NVIDIA NIM for LLMs with multi-gpus

=========================================== == NVIDIA Inference Microservice LLM NIM ==

NVIDIA Inference Microservice LLM NIM Version 1.0.0 Model: nim/meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here: https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/. A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3. A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

2024-06-12 16:55:56,247 [INFO] PyTorch version 2.2.2 available. 2024-06-12 16:55:58,199 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error 2024-06-12 16:55:58,199 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init. 2024-06-12 16:56:00,013 [INFO] [TRT-LLM] [I] TensorRT-LLM inited. [TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000 INFO 06-12 16:56:01.903 api_server.py:489] NIM LLM API version 1.0.0 INFO 06-12 16:56:02.322 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA. INFO 06-12 16:56:02.322 ngc_profile.py:219] Detected 2 compatible profile(s). INFO 06-12 16:56:02.322 ngc_injector.py:106] Valid profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) on GPUs [0, 1, 3] INFO 06-12 16:56:02.322 ngc_injector.py:106] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0, 1, 3] INFO 06-12 16:56:02.322 ngc_injector.py:141] Selected profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) INFO 06-12 16:56:03.340 ngc_injector.py:146] Profile metadata: feat_lora: false INFO 06-12 16:56:03.340 ngc_injector.py:146] Profile metadata: tp: 2 INFO 06-12 16:56:03.341 ngc_injector.py:146] Profile metadata: llm_engine: vllm INFO 06-12 16:56:03.341 ngc_injector.py:146] Profile metadata: precision: fp16 INFO 06-12 16:56:03.341 ngc_injector.py:166] Preparing model workspace. This step might download additional files to run the model. INFO 06-12 16:56:06.917 ngc_injector.py:172] Model workspace is now ready. It took 3.576 seconds 2024-06-12 16:56:10,525 INFO worker.py:1749 -- Started a local Ray instance. INFO 06-12 16:56:12.479 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-t0lfe1_z', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-t0lfe1_z', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0) WARNING 06-12 16:56:12.786 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 06-12 16:56:18.316 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 (RayWorkerWrapper pid=8763) INFO 06-12 16:56:18 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 INFO 06-12 16:56:22 selector.py:28] Using FlashAttention backend. (RayWorkerWrapper pid=8763) INFO 06-12 16:56:23 selector.py:28] Using FlashAttention backend. INFO 06-12 16:56:28 pynccl_utils.py:43] vLLM is using nccl==2.19.3 (RayWorkerWrapper pid=8763) INFO 06-12 16:56:28 pynccl_utils.py:43] vLLM is using nccl==2.19.3 INFO 06-12 16:56:32.741 utils.py:116] generating GPU P2P access cache for in /opt/nim/.cache/vllm/vllm/gpu_p2p_access_cache_for_0,1.json INFO 06-12 16:56:39.267 utils.py:130] reading GPU P2P access cache from /opt/nim/.cache/vllm/vllm/gpu_p2p_access_cache_for_0,1.json WARNING 06-12 16:56:39 custom_all_reduce.py:74] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (RayWorkerWrapper pid=8763) INFO 06-12 16:56:39 utils.py:130] reading GPU P2P access cache from /opt/nim/.cache/vllm/vllm/gpu_p2p_access_cache_for_0,1.json (RayWorkerWrapper pid=8763) WARNING 06-12 16:56:39 custom_all_reduce.py:74] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.

yirunwang avatar Jun 13 '24 01:06 yirunwang