NCCL errors while running LLAMA2 70b benchmark shmoo with batch size=128 and input length=2048 on 4 H100 GPUs
System Info
- CPU Arch x86
- 4 H100 CPUs
- using commit 6cc5e177ff2fb60b1aab3b03fa0534b5181cf0f1
Who can help?
@kaiyux @byshiue
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
- python examples/llama/build.py
--remove_input_padding
--enable_context_fmha
--parallel_build
--output_dir /tmp/engines/llama/70b
--dtype float16
--use_gpt_attention_plugin float16
--world_size 4
--tp_size 4
--pp_size 1
--max_batch_size 128
--max_input_len 2048
--max_output_len 2048
--enable_fp8
--fp8_kv_cache
--strongly_typed
--n_layer 80
--n_head 64
--n_kv_head 8
--n_embd 8192
--inter_size 28672
--vocab_size 32000
--n_positions 4096
--hidden_act silu
--ffn_dim_multiplier 1.3
--multiple_of 4096 - mpirun -n 4 --allow-run-as-root --oversubscribe ./cpp/build/benchmarks/gptSessionBenchmark --model llama --engine_dir /tmp/engines/llama/70b --warm_up 1 --batch_size 128 --duration 0 --num_runs 5 --input_output_len 2048,1 done
Expected behavior
Expected to get valid perf number as other batch and input length combination. But it failed
actual behavior
BS: 128, ISL/OSL: 2048,1 Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers' Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers' Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers' Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers'
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[39943,1],1]
Exit code: 1
additional notes
This only happens for batch=128 and input=2048. Other combinations (like batch=64 and input=2048) work well.
please set NCCL_DEBUG=INFO, run the tests again, and see if we can get more detailed logs.
@PerkzZheng Please see my log with NCCL_DEBUG=INFO below:
dc:255754:255754 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 dc:255754:255754 [1] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer dc:255756:255756 [3] NCCL INFO comm 0x55e4ee26b070 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 61000 commId 0x7849b58874b511a0 - Init COMPLETE dc:255754:255754 [1] NCCL INFO comm 0x563c749eae90 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 43000 commId 0x7849b58874b511a0 - Init COMPLETE dc:255755:255755 [2] NCCL INFO comm 0x55f62569e9d0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 52000 commId 0x7849b58874b511a0 - Init COMPLETE dc:255753:255753 [0] NCCL INFO comm 0x55c6f20b7fc0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1b000 commId 0x7849b58874b511a0 - Init COMPLETE
dc:255756:255756 [3] enqueue.cc:1182 NCCL WARN Error : no algorithm/protocol available dc:255756:255756 [3] NCCL INFO enqueue.cc:1283 -> 3 dc:255756:255756 [3] NCCL INFO enqueue.cc:569 -> 3 dc:255756:255756 [3] NCCL INFO enqueue.cc:945 -> 3 dc:255756:255756 [3] NCCL INFO group.cc:130 -> 3 dc:255756:255756 [3] NCCL INFO group.cc:325 -> 3 dc:255756:255756 [3] NCCL INFO group.cc:406 -> 3 dc:255756:255756 [3] NCCL INFO enqueue.cc:1594 -> 3 Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers'
dc:255754:255754 [1] enqueue.cc:1182 NCCL WARN Error : no algorithm/protocol available dc:255754:255754 [1] NCCL INFO enqueue.cc:1283 -> 3 dc:255754:255754 [1] NCCL INFO enqueue.cc:569 -> 3 dc:255754:255754 [1] NCCL INFO enqueue.cc:945 -> 3 dc:255754:255754 [1] NCCL INFO group.cc:130 -> 3 dc:255754:255754 [1] NCCL INFO group.cc:325 -> 3 dc:255754:255754 [1] NCCL INFO group.cc:406 -> 3 dc:255754:255754 [1] NCCL INFO enqueue.cc:1594 -> 3 Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers'
dc:255753:255753 [0] enqueue.cc:1182 NCCL WARN Error : no algorithm/protocol available dc:255753:255753 [0] NCCL INFO enqueue.cc:1283 -> 3 dc:255753:255753 [0] NCCL INFO enqueue.cc:569 -> 3 dc:255753:255753 [0] NCCL INFO enqueue.cc:945 -> 3 dc:255753:255753 [0] NCCL INFO group.cc:130 -> 3 dc:255753:255753 [0] NCCL INFO group.cc:325 -> 3 dc:255753:255753 [0] NCCL INFO group.cc:406 -> 3 dc:255753:255753 [0] NCCL INFO enqueue.cc:1594 -> 3 Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers'
dc:255755:255755 [2] enqueue.cc:1182 NCCL WARN Error : no algorithm/protocol available dc:255755:255755 [2] NCCL INFO enqueue.cc:1283 -> 3 dc:255755:255755 [2] NCCL INFO enqueue.cc:569 -> 3 dc:255755:255755 [2] NCCL INFO enqueue.cc:945 -> 3 dc:255755:255755 [2] NCCL INFO group.cc:130 -> 3 dc:255755:255755 [2] NCCL INFO group.cc:325 -> 3 dc:255755:255755 [2] NCCL INFO group.cc:406 -> 3 dc:255755:255755 [2] NCCL INFO enqueue.cc:1594 -> 3 Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers'
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[47368,1],2] Exit code: 1
did you set the ALGO explicitly ? and could you try nccl-tests with the same environment ?
@PerkzZheng I did not set ALGO explicitly. Could you please provide more info on how to run nccl-tests? Does nccl-test also support different batch size and input len?
This error only happens when batch size is equal or greater than 128 and input len=2048. I don't think it is a general nccl issue.
@anchorbob thanks. we got similar reports from other users. Note that we are on track of this issue, and we will keep you posted if we find any solutions.
@PerkzZheng , is there any update on this issue? Thanks