sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Bug] Stuck at NCCL initialization when TP>1

Open pingzhili opened this issue 10 months ago • 3 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [x] 5. Please use English, otherwise it will be closed.

Describe the bug

Many thanks for this great work! When using TP>1, it will stuck at NCCL initialization:

INFO 02-18 09:33:49 init.py:190] Automatically detected platform cuda.
[2025-02-18 09:33:55] server_args=ServerArgs(model_path='meta-llama/Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Llama-3.1-8B-Instruct' , tokenizer_mode='auto', load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='127.0.0.1', port=23333, mem_fraction_static=0.87, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=2, stream_interval=1, stream_output=False, random_seed=149565980, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=True, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, return_hidden_states=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False) /usr/local/lib/python3.10/dist-packages/transformers/models/auto/image_processing_auto.py:590: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead warnings.warn( /usr/local/lib/python3.10/dist-packages/transformers/models/auto/image_processing_auto.py:590: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead warnings.warn( /usr/local/lib/python3.10/dist-packages/transformers/models/auto/image_processing_auto.py:590: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead warnings.warn( INFO 02-18 09:33:59 init.py:190] Automatically detected platform cuda. INFO 02-18 09:33:59 init.py:190] Automatically detected platform cuda. INFO 02-18 09:33:59 init.py:190] Automatically detected platform cuda. [2025-02-18 09:34:04 TP0] Init torch distributed begin. [2025-02-18 09:34:05 TP1] Init torch distributed begin. [2025-02-18 09:34:05 TP1] sglang is using nccl==2.21.5 [2025-02-18 09:34:05 TP0] sglang is using nccl==2.21.5 unites4:188:188 [0] NCCL INFO Bootstrap : Using ibp194s0f0:10.2.133.35<0> unites4:188:188 [0] NCCL INFO cudaDriverVersion 12060 NCCL version 2.21.5+cuda12.4

Reproduction

sudo docker run -e NCCL_DEBUG=TRACE --gpus all --shm-size 32g -p 0.0.0.0:23333:23333 -v ~/.cache/huggingface:/root/.cache/huggingface -v /home/pingzhi/model-checkpoints:/model-checkpoints --ipc=host --network=host --privileged lmsysorg/sglang:latest python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --tp 2 --enable-p2p-check --trust-remote-code --port 23333

Environment

INFO 02-18 09:40:33 init.py:190] Automatically detected platform cuda.
Python: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA RTX 6000 Ada Generation
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 560.35.03
PyTorch: 2.5.1+cu124
sgl_kernel: 0.0.3.post6
flashinfer: 0.2.1.post2+cu124torch2.5
triton: 3.1.0
transformers: 4.48.3
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.12
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.23.0
orjson: 3.10.15
packaging: 24.2
psutil: 7.0.0
pydantic: 2.10.6
multipart: 0.0.20 zmq: 26.2.1 uvicorn: 0.34.0 uvloop: 0.21.0 vllm: 0.7.2 openai: 1.63.2 tiktoken: 0.9.0 anthropic: 0.45.2 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE NODE NODE SYS SYS SYS SYS SYS SYS 0-23,48-71 0 N/A GPU1 NODE X NODE NODE SYS SYS SYS SYS SYS SYS 0-23,48-71 0 N/A GPU2 NODE NODE X NODE SYS SYS SYS SYS SYS SYS 0-23,48-71 0 N/A GPU3 NODE NODE NODE X SYS SYS SYS SYS SYS SYS 0-23,48-71 0 N/A GPU4 SYS SYS SYS SYS X NODE NODE NODE NODE NODE 24-47,72-95 1 N/A GPU5 SYS SYS SYS SYS NODE X NODE NODE NODE NODE 24-47,72-95 1 N/A GPU6 SYS SYS SYS SYS NODE NODE X NODE PHB PHB 24-47,72-95 1 N/A GPU7 SYS SYS SYS SYS NODE NODE NODE X NODE NODE 24-47,72-95 1 N/A NIC0 SYS SYS SYS SYS NODE NODE PHB NODE X PIX NIC1 SYS SYS SYS SYS NODE NODE PHB NODE PIX X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0 NIC1: mlx5_1

ulimit soft: 1048576

pingzhili avatar Feb 18 '25 09:02 pingzhili

solved by adding NCCL_P2P_DISABLE=1, but still confused and worried about the performance. would greatly appreciate it if someone could kindly help on this. :)

pingzhili avatar Feb 18 '25 09:02 pingzhili

So what's your current issue? Do you evaluate it on any benchmark since you mentioned the performance? You are welcome to report any accuracy issue and we will take a look later.

jhinpan avatar Feb 18 '25 17:02 jhinpan

solved by adding NCCL_P2P_DISABLE=1, but still confused and worried about the performance. would greatly appreciate it if someone could kindly help on this. :)

setup NCCL_DEBUG=TRACE, see the log. or you can have a try to set NCCL_IB_GID_INDEX according to your env

whybeyoung avatar Feb 19 '25 02:02 whybeyoung

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

github-actions[bot] avatar Apr 21 '25 00:04 github-actions[bot]