tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

server fails in Stuck when using pipeline parallel in multi-nodes

Open hezeli123 opened this issue 1 year ago • 2 comments

System Info

2 * 4 L40s load llama2-70B, 1 model: tensorrt_llm. using image: nvcr.io/nvidia/tritonserver:23.11-trtllm-python-py3

Who can help?

No response

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

1.python build.py --model_dir xxx --dtype float16 --remove_input_padding --enable_context_fmha --multi_block_mode --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --use_inflight_batching --output_dir ./engine.inflight.tp4pp2.70b --world_size 8 --tp_size 4 --pp_size 2 --multi_block_mode --max_input_len 8192 --max_output_len 16384 --vocab_size=49954 2.mpirun -np 8 --allow-run-as-root --hostfile myhosts -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ucx trtionserver --model-repository=xxx --disable-auto-complete-config 3.python3 inflight_batcher_llm_client.py -u xxx:8001 --text "Hello, how " –tokenizer-dir=70b -S --request-output-len 40

Expected behavior

load sucessfully & infer successfully

actual behavior

when infering using client, server was stucked(4 GPUs in one nodes is 100%, but the other node's 4 GPU is 0): WARNING: Logging before InitGoogleLogging() is written to STDERR I20240227 08:25:18.989490 35588 grpc_server.cc:2495] Started GRPCInferenceService at 0.0.0.0:8001 I20240227 08:25:18.991181 35588 http_server.cc:4997] Started HTTPService at 0.0.0.0:8000 I20240227 08:25:19.032207 35588 http_server.cc:282] Started Metrics Service at 0.0.0.0:8002 dg11-train-prod001-node-10-224-96-171:13983:14088 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[0] [receive] via NET/IBext/0/Shared dg11-train-prod001-node-10-224-96-171:13983:14088 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[0] [receive] via NET/IBext/1/Shared dg11-train-prod001-node-10-224-96-171:13985:14089 [2] NCCL INFO Channel 00/1 : 0[2] -> 1[2] [receive] via NET/IBext/2/Shared dg11-train-prod001-node-10-224-96-171:13985:14089 [2] NCCL INFO Channel 01/1 : 0[2] -> 1[2] [receive] via NET/IBext/3/Shared dg11-train-prod001-node-10-224-96-171:13984:14090 [1] NCCL INFO Channel 00/1 : 0[1] -> 1[1] [receive] via NET/IBext/1/Shared dg11-train-prod001-node-10-224-96-171:13984:14090 [1] NCCL INFO Channel 01/1 : 0[1] -> 1[1] [receive] via NET/IBext/0/Shared dg11-train-prod001-node-10-224-96-171:13986:14091 [3] NCCL INFO Channel 00/1 : 0[3] -> 1[3] [receive] via NET/IBext/3/Shared dg11-train-prod001-node-10-224-96-171:13986:14091 [3] NCCL INFO Channel 01/1 : 0[3] -> 1[3] [receive] via NET/IBext/2/Shared dg11-train-prod001-node-10-224-96-174:35588:35724 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[0] [send] via NET/IBext/0/Shared dg11-train-prod001-node-10-224-96-174:35588:35724 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[0] [send] via NET/IBext/1/Shared dg11-train-prod001-node-10-224-96-174:35589:35725 [1] NCCL INFO Channel 00/1 : 0[1] -> 1[1] [send] via NET/IBext/1/Shared dg11-train-prod001-node-10-224-96-174:35589:35725 [1] NCCL INFO Channel 01/1 : 0[1] -> 1[1] [send] via NET/IBext/0/Shared dg11-train-prod001-node-10-224-96-174:35590:35726 [2] NCCL INFO Channel 00/1 : 0[2] -> 1[2] [send] via NET/IBext/2/Shared dg11-train-prod001-node-10-224-96-174:35590:35726 [2] NCCL INFO Channel 01/1 : 0[2] -> 1[2] [send] via NET/IBext/3/Shared dg11-train-prod001-node-10-224-96-174:35591:35727 [3] NCCL INFO Channel 00/1 : 0[3] -> 1[3] [send] via NET/IBext/3/Shared dg11-train-prod001-node-10-224-96-174:35591:35727 [3] NCCL INFO Channel 01/1 : 0[3] -> 1[3] [send] via NET/IBext/2/Shared

additional notes

when not using pp, server is ok(only using tp=8)

hezeli123 avatar Feb 28 '24 02:02 hezeli123