TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

LLama model does not work on multi-gpu

Open jacob-crux opened this issue 1 year ago • 9 comments

System Info

  • GPU : NVIDIA A100 80GB x 4

  • Container used - triton inference server 23.12

  • package version tensorrt 9.2.0.post12.dev5 tensorrt-llm 0.8.0.dev2024013000 nvidia-ammo 0.7.1

Who can help?

@byshiue @Tracin

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

  1. Install first with pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
  2. As in readme.md in examples/llama, trtllm-build is applied after checkpoint conversion.
  3. Runs on 4 GPUs
  • conversion script
python3 convert_checkpoint.py \
--model_dir ~/ckpt/llama-2-70b-hf \
--output_dir ~/convert_ckpt/llama_2_70b_4gpu_tp4 \
--dtype float16 \
--tp_size 4
  • trtllm-build script
trtllm-build \
--checkpoint_dir ~/convert_ckpt/llama_2_70b_4gpu_tp4 \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--max_input_len 4096 \
--output_dir ~/trtllm_ckpt/llama_2_70b_fp16_4gpu_tp4
  • run script
mpirun -n 4 \
python3 ../run.py \
--max_output_len 32 \
--max_input_length 2048 \
--input_file ~/data/pg64317.txt \
--engine_dir ~/trtllm_ckpt/llama_2_70b_fp16_4gpu_tp4 \
--tokenizer_dir ~/tokenizer/llama

Expected behavior

Expect sentences to be generated normally

actual behavior

[TensorRT-LLM][INFO] Engine version 0.8.0.dev2024013000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 4, rank: 1
[TensorRT-LLM][INFO] Engine version 0.8.0.dev2024013000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 4, rank: 3
[TensorRT-LLM][INFO] Engine version 0.8.0.dev2024013000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 4, rank: 2
[TensorRT-LLM][INFO] Engine version 0.8.0.dev2024013000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 4, rank: 0
[TensorRT-LLM][INFO] Loaded engine size: 33276 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33439, GPU 33705 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 33441, GPU 33715 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] Loaded engine size: 33276 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33439, GPU 33705 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 33441, GPU 33715 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] Loaded engine size: 33276 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33439, GPU 33705 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 33441, GPU 33715 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] Loaded engine size: 33276 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33439, GPU 33705 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 33441, GPU 33715 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +33267, now: CPU 0, GPU 33267 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +33267, now: CPU 0, GPU 33267 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +33267, now: CPU 0, GPU 33267 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +33267, now: CPU 0, GPU 33267 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33618, GPU 35025 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33618, GPU 35169 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 33618, GPU 35177 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 33618, GPU 35033 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33618, GPU 35025 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33618, GPU 35169 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 33618, GPU 35033 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 33618, GPU 35177 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 33267 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 33267 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 33267 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 33267 (MiB)
[TensorRT-LLM][INFO] Allocate 43306188800 bytes for k/v cache. 
[TensorRT-LLM][INFO] Allocate 43306188800 bytes for k/v cache. 
[TensorRT-LLM][INFO] Allocate 43306188800 bytes for k/v cache. 
[TensorRT-LLM][INFO] Allocate 43306188800 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 528640 tokens in paged KV cache.
[TensorRT-LLM][INFO] Using 528640 tokens in paged KV cache.
[TensorRT-LLM][INFO] Using 528640 tokens in paged KV cache.
[TensorRT-LLM][INFO] Using 528640 tokens in paged KV cache.
Failed, NCCL error Failed, NCCL error /data/TensorRT-LLM/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 'unknown result code'
Failed, NCCL error Failed, NCCL error /data/TensorRT-LLM/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 'unknown result code'
Failed, NCCL error Failed, NCCL error /data/TensorRT-LLM/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 'unknown result code'
Failed, NCCL error Failed, NCCL error /data/TensorRT-LLM/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 'unknown result code'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[7365,1],1]
  Exit code:    1
--------------------------------------------------------------------------

additional notes

The llama 7b model operates normally with one A100 GPU, but does not operate with multiple GPUs.

tensorrt-llm 0.8.0.dev2024012302 works with multi gpu, but gpu oom occurs when quantize.py is performed on the llama 70B model with awq.

I confirmed that tp_size and pp_size were added to quantize.py in an update two days ago, and quantization was completed in the latest pre-release version(0.8.0.dev2024013000), but similar to above, it failed in multi-gpu, NCCL error /home/jenkins/ agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allgatherPlugin.cpp:103 An 'unknown result code' error occurs.

I would like to check that the llama 70b model operates with A100 80GB x 2 using awq, and I would like to ask for help with the above problem.

jacob-crux avatar Feb 02 '24 07:02 jacob-crux

Can you add NCCL_DEBUG=INFO, and see if we can get more detailed logs ?

PerkzZheng avatar Feb 04 '24 06:02 PerkzZheng

Detailed output results have also been added with NCCL_DEBUG=INFO. Can you tell me which version of torch, nccl, cuda, and cudnn I should use to check the operation of the main branch?

[TensorRT-LLM][INFO] Engine version 0.8.0.dev2024013000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 4, rank: 2
[TensorRT-LLM][INFO] Engine version 0.8.0.dev2024013000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 4, rank: 1
[TensorRT-LLM][INFO] Engine version 0.8.0.dev2024013000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.8.0.dev2024013000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 4, rank: 0
[TensorRT-LLM][INFO] MPI size: 4, rank: 3
^[[C[TensorRT-LLM][INFO] Loaded engine size: 33276 MiB
[TensorRT-LLM][INFO] Loaded engine size: 33276 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33439, GPU 33705 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 33441, GPU 33715 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
instance-12463:491480:491480 [0] NCCL INFO Bootstrap : Using eth0:xx.xxx.xxx.xxx<0>
instance-12463:491480:491480 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
instance-12463:491480:491480 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
instance-12463:491480:491480 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
instance-12463:491480:491480 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
instance-12463:491480:491480 [0] NCCL INFO cudaDriverVersion 12030
NCCL version 2.19.3+cuda12.3
instance-12463:491480:491480 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
instance-12463:491480:491480 [0] NCCL INFO P2P plugin IBext
instance-12463:491480:491480 [0] NCCL INFO NET/IB : No device found.
instance-12463:491480:491480 [0] NCCL INFO NET/IB : No device found.
instance-12463:491480:491480 [0] NCCL INFO NET/Socket : Using [0]eth0:xx.xxx.xxx.xxx<0>
instance-12463:491480:491480 [0] NCCL INFO Using non-device net plugin version 0
instance-12463:491480:491480 [0] NCCL INFO Using network Socket
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33439, GPU 33705 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 33441, GPU 33715 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
instance-12463:491481:491481 [1] NCCL INFO cudaDriverVersion 12030
instance-12463:491481:491481 [1] NCCL INFO Bootstrap : Using eth0:xx.xxx.xxx.xxx<0>
instance-12463:491481:491481 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
instance-12463:491481:491481 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
instance-12463:491481:491481 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
instance-12463:491481:491481 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
instance-12463:491481:491481 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
instance-12463:491481:491481 [1] NCCL INFO P2P plugin IBext
instance-12463:491481:491481 [1] NCCL INFO NET/IB : No device found.
instance-12463:491481:491481 [1] NCCL INFO NET/IB : No device found.
instance-12463:491481:491481 [1] NCCL INFO NET/Socket : Using [0]eth0:xx.xxx.xxx.xxx<0>
instance-12463:491481:491481 [1] NCCL INFO Using non-device net plugin version 0
instance-12463:491481:491481 [1] NCCL INFO Using network Socket
[TensorRT-LLM][INFO] Loaded engine size: 33276 MiB
[TensorRT-LLM][INFO] Loaded engine size: 33276 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33439, GPU 33705 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 33441, GPU 33715 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
instance-12463:491483:491483 [3] NCCL INFO cudaDriverVersion 12030
instance-12463:491483:491483 [3] NCCL INFO Bootstrap : Using eth0:xx.xxx.xxx.xxx<0>
instance-12463:491483:491483 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
instance-12463:491483:491483 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
instance-12463:491483:491483 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
instance-12463:491483:491483 [3] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
instance-12463:491483:491483 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
instance-12463:491483:491483 [3] NCCL INFO P2P plugin IBext
instance-12463:491483:491483 [3] NCCL INFO NET/IB : No device found.
instance-12463:491483:491483 [3] NCCL INFO NET/IB : No device found.
instance-12463:491483:491483 [3] NCCL INFO NET/Socket : Using [0]eth0:xx.xxx.xxx.xxx<0>
instance-12463:491483:491483 [3] NCCL INFO Using non-device net plugin version 0
instance-12463:491483:491483 [3] NCCL INFO Using network Socket
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33439, GPU 33705 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 33441, GPU 33715 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
instance-12463:491482:491482 [2] NCCL INFO cudaDriverVersion 12030
instance-12463:491482:491482 [2] NCCL INFO Bootstrap : Using eth0:xx.xxx.xxx.xxx<0>
instance-12463:491482:491482 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
instance-12463:491482:491482 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
instance-12463:491482:491482 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
instance-12463:491482:491482 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
instance-12463:491482:491482 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
instance-12463:491482:491482 [2] NCCL INFO P2P plugin IBext
instance-12463:491482:491482 [2] NCCL INFO NET/IB : No device found.
instance-12463:491482:491482 [2] NCCL INFO NET/IB : No device found.
instance-12463:491482:491482 [2] NCCL INFO NET/Socket : Using [0]eth0:xx.xxx.xxx.xxx<0>
instance-12463:491482:491482 [2] NCCL INFO Using non-device net plugin version 0
instance-12463:491482:491482 [2] NCCL INFO Using network Socket
instance-12463:491482:491482 [2] NCCL INFO comm 0x562d516cb330 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 98000 commId 0x4783b36d096bc421 - Init START
instance-12463:491483:491483 [3] NCCL INFO comm 0x55c6ba19ab40 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId d0000 commId 0x4783b36d096bc421 - Init START
instance-12463:491480:491480 [0] NCCL INFO comm 0x561c8f5e25f0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId e000 commId 0x4783b36d096bc421 - Init START
instance-12463:491481:491481 [1] NCCL INFO comm 0x55e8e51fe410 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 50000 commId 0x4783b36d096bc421 - Init START
instance-12463:491482:491482 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff,00000000
instance-12463:491482:491482 [2] NCCL INFO NVLS multicast support is not available on dev 2
instance-12463:491481:491481 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff
instance-12463:491481:491481 [1] NCCL INFO NVLS multicast support is not available on dev 1
instance-12463:491480:491480 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
instance-12463:491480:491480 [0] NCCL INFO NVLS multicast support is not available on dev 0
instance-12463:491483:491483 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff,00000000
instance-12463:491483:491483 [3] NCCL INFO NVLS multicast support is not available on dev 3
instance-12463:491482:491482 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
instance-12463:491482:491482 [2] NCCL INFO P2P Chunksize set to 524288
instance-12463:491480:491480 [0] NCCL INFO Channel 00/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 01/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 02/24 :    0   1   2   3
instance-12463:491483:491483 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2 [4] -1/-1/-1->3->2 [5] -1/-1/-1->3->2 [6] -1/-1/-1->3->2 [7] -1/-1/-1->3->2 [8] -1/-1/-1->3->2 [9] -1/-1/-1->3->2 [10] -1/-1/-1->3->2 [11] -1/-1/-1->3->2 [12] -1/-1/-1->3->2 [13] -1/-1/-1->3->2 [14] -1/-1/-1->3->2 [15] -1/-1/-1->3->2 [16] -1/-1/-1->3->2 [17] -1/-1/-1->3->2 [18] -1/-1/-1->3->2 [19] -1/-1/-1->3->2 [20] -1/-1/-1->3->2 [21] -1/-1/-1->3->2 [22] -1/-1/-1->3->2 [23] -1/-1/-1->3->2
instance-12463:491483:491483 [3] NCCL INFO P2P Chunksize set to 524288
instance-12463:491480:491480 [0] NCCL INFO Channel 03/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 04/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 05/24 :    0   1   2   3
instance-12463:491481:491481 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
instance-12463:491481:491481 [1] NCCL INFO P2P Chunksize set to 524288
instance-12463:491480:491480 [0] NCCL INFO Channel 06/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 07/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 08/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 09/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 10/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 11/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 12/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 13/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 14/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 15/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 16/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 17/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 18/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 19/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 20/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 21/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 22/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Channel 23/24 :    0   1   2   3
instance-12463:491480:491480 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
instance-12463:491480:491480 [0] NCCL INFO P2P Chunksize set to 524288
instance-12463:491481:491481 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 02/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 03/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 04/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 05/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 06/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 07/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 08/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 09/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 10/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 11/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 12/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 13/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 14/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 15/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 16/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 17/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 18/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 19/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 20/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 21/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 22/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 23/0 : 3[3] -> 0[0] via P2P/CUMEM/read
instance-12463:491480:491480 [0] NCCL INFO Connected all rings
instance-12463:491483:491483 [3] NCCL INFO Connected all rings
instance-12463:491483:491483 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Connected all rings
instance-12463:491481:491481 [1] NCCL INFO Connected all rings
instance-12463:491483:491483 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 16/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 17/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 18/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 19/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 20/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 21/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 22/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Channel 23/0 : 3[3] -> 2[2] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 20/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 22/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491482:491482 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/CUMEM/read
instance-12463:491481:491481 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12463:491483:491483 [3] NCCL INFO Connected all trees
instance-12463:491483:491483 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
instance-12463:491483:491483 [3] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
instance-12463:491480:491480 [0] NCCL INFO Connected all trees
instance-12463:491480:491480 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
instance-12463:491480:491480 [0] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
instance-12463:491482:491482 [2] NCCL INFO Connected all trees
instance-12463:491482:491482 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
instance-12463:491482:491482 [2] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
instance-12463:491481:491481 [1] NCCL INFO Connected all trees
instance-12463:491481:491481 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
instance-12463:491481:491481 [1] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
instance-12463:491481:491481 [1] NCCL INFO comm 0x55e8e51fe410 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 50000 commId 0x4783b36d096bc421 - Init COMPLETE
instance-12463:491480:491480 [0] NCCL INFO comm 0x561c8f5e25f0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId e000 commId 0x4783b36d096bc421 - Init COMPLETE
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +33267, now: CPU 0, GPU 33267 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +33267, now: CPU 0, GPU 33267 (MiB)
instance-12463:491482:491482 [2] NCCL INFO comm 0x562d516cb330 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 98000 commId 0x4783b36d096bc421 - Init COMPLETE
instance-12463:491483:491483 [3] NCCL INFO comm 0x55c6ba19ab40 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId d0000 commId 0x4783b36d096bc421 - Init COMPLETE
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +33267, now: CPU 0, GPU 33267 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +33267, now: CPU 0, GPU 33267 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33618, GPU 35169 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 33618, GPU 35177 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33617, GPU 35025 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 33618, GPU 35033 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33618, GPU 35169 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 33617, GPU 35025 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 33618, GPU 35177 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 33618, GPU 35033 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 33267 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 33267 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 33267 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 33267 (MiB)
[TensorRT-LLM][INFO] Allocate 43306188800 bytes for k/v cache. 
[TensorRT-LLM][INFO] Allocate 43306188800 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 528640 tokens in paged KV cache.
[TensorRT-LLM][INFO] Allocate 43306188800 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 528640 tokens in paged KV cache.
[TensorRT-LLM][INFO] Allocate 43306188800 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 528640 tokens in paged KV cache.
[TensorRT-LLM][INFO] Using 528640 tokens in paged KV cache.

instance-12463:491483:491483 [3] init.cc:303 NCCL WARN Attempt to use communicator before the previous operation returned ncclSuccess
instance-12463:491483:491483 [3] NCCL INFO enqueue.cc:1605 -> 512
instance-12463:491483:491483 [3] NCCL INFO enqueue.cc:1623 -> 512
Failed, NCCL error /data/TensorRT-LLM/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 'unknown result code'
instance-12463:491482:492316 [2] NCCL INFO [Service thread] Connection closed by localRank 3
instance-12463:491480:492314 [0] NCCL INFO [Service thread] Connection closed by localRank 3

instance-12463:491482:491482 [2] init.cc:303 NCCL WARN Attempt to use communicator before the previous operation returned ncclSuccess
instance-12463:491482:491482 [2] NCCL INFO enqueue.cc:1605 -> 512
instance-12463:491482:491482 [2] NCCL INFO enqueue.cc:1623 -> 512
Failed, NCCL error /data/TensorRT-LLM/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 'unknown result code'

instance-12463:491480:491480 [0] init.cc:303 NCCL WARN Attempt to use communicator before the previous operation returned ncclSuccess
instance-12463:491480:491480 [0] NCCL INFO enqueue.cc:1605 -> 512
instance-12463:491480:491480 [0] NCCL INFO enqueue.cc:1623 -> 512
Failed, NCCL error /data/TensorRT-LLM/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 'unknown result code'

instance-12463:491481:491481 [1] init.cc:303 NCCL WARN Attempt to use communicator before the previous operation returned ncclSuccess
instance-12463:491481:491481 [1] NCCL INFO enqueue.cc:1605 -> 512
instance-12463:491481:491481 [1] NCCL INFO enqueue.cc:1623 -> 512
Failed, NCCL error /data/TensorRT-LLM/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 'unknown result code'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[57126,1],3]
  Exit code:    1

jacob-crux avatar Feb 04 '24 08:02 jacob-crux

Can you tell me which version of torch, nccl, cuda, and cudnn I should use to check the operation of the main branch?

it is recommended to use the built container's environment. I will share this with the nccl team to see if they have any clues.

PerkzZheng avatar Feb 04 '24 08:02 PerkzZheng

@khj94 Can you try NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,COLL,NET for a more detailed log ? thanks.

PerkzZheng avatar Feb 05 '24 08:02 PerkzZheng

I have shared detailed logs again. Since the environment I am currently using cannot build a docker image, I imported the ngc image, installed tensorrt, and built the tensorrt-llm code.

I confirmed that the pre release version was also 0.8.0.dev2024013000, so I installed it and confirmed that it works on multi-gpu.

[TensorRT-LLM][INFO] Engine version 0.8.0.dev2024013000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] Engine version 0.8.0.dev2024013000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
[TensorRT-LLM][INFO] Loaded engine size: 66041 MiB
[TensorRT-LLM][INFO] Loaded engine size: 66041 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 66211, GPU 66471 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 66212, GPU 66481 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 66211, GPU 66471 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 66212, GPU 66481 (MiB)
instance-12628:17665:17665 [0] NCCL INFO Bootstrap : Using eth0:xx.xxx.xxx.xxx<0>
instance-12628:17665:17665 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
instance-12628:17665:17665 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
instance-12628:17665:17665 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
instance-12628:17665:17665 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
instance-12628:17665:17665 [0] NCCL INFO cudaDriverVersion 12030
NCCL version 2.19.3+cuda12.3
instance-12628:17666:17666 [1] NCCL INFO cudaDriverVersion 12030
instance-12628:17666:17666 [1] NCCL INFO Bootstrap : Using eth0:xx.xxx.xxx.xxx<0>
instance-12628:17666:17666 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
instance-12628:17666:17666 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
instance-12628:17666:17666 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
instance-12628:17666:17666 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
instance-12628:17665:17665 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
instance-12628:17665:17665 [0] NCCL INFO P2P plugin IBext
instance-12628:17665:17665 [0] NCCL INFO NET/IB : No device found.
instance-12628:17665:17665 [0] NCCL INFO NET/IB : No device found.
instance-12628:17665:17665 [0] NCCL INFO NET/Socket : Using [0]eth0:xx.xxx.xxx.xxx<0>
instance-12628:17665:17665 [0] NCCL INFO Using non-device net plugin version 0
instance-12628:17665:17665 [0] NCCL INFO Using network Socket
instance-12628:17666:17666 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
instance-12628:17666:17666 [1] NCCL INFO P2P plugin IBext
instance-12628:17666:17666 [1] NCCL INFO NET/IB : No device found.
instance-12628:17666:17666 [1] NCCL INFO NET/IB : No device found.
instance-12628:17666:17666 [1] NCCL INFO NET/Socket : Using [0]eth0:xx.xxx.xxx.xxx<0>
instance-12628:17666:17666 [1] NCCL INFO Using non-device net plugin version 0
instance-12628:17666:17666 [1] NCCL INFO Using network Socket
instance-12628:17665:17665 [0] NCCL INFO comm 0x5646eba11280 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 7000 commId 0x18e4fe8089d2455d - Init START
instance-12628:17666:17666 [1] NCCL INFO comm 0x561b8dbdac50 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId cb000 commId 0x18e4fe8089d2455d - Init START
instance-12628:17666:17666 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'eth0'
instance-12628:17666:17666 [1] NCCL INFO Setting affinity for GPU 1 to ff00,00000000,0000ff00,00000000
instance-12628:17665:17665 [0] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'eth0'
instance-12628:17665:17665 [0] NCCL INFO Setting affinity for GPU 0 to ff000000,00000000,ff000000
instance-12628:17665:17665 [0] NCCL INFO Channel 00/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 01/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 02/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 03/24 :    0   1
instance-12628:17666:17666 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
instance-12628:17666:17666 [1] NCCL INFO P2P Chunksize set to 524288
instance-12628:17665:17665 [0] NCCL INFO Channel 04/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 05/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 06/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 07/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 08/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 09/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 10/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 11/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 12/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 13/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 14/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 15/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 16/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 17/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 18/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 19/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 20/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 21/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 22/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Channel 23/24 :    0   1
instance-12628:17665:17665 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1
instance-12628:17665:17665 [0] NCCL INFO P2P Chunksize set to 524288
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 0 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c004f40
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 1 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c004fb8
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 2 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005030
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 3 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0050a8
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 4 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005120
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 5 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005198
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 6 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005210
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 7 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005288
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 8 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005300
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 9 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005378
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 10 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0053f0
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 11 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005468
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 0 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58004f30
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 12 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0054e0
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 1 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58004fa8
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 13 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005558
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 2 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005020
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 14 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0055d0
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 3 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005098
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 15 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005648
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 4 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005110
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 16 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0056c0
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 5 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005188
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 17 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005738
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 6 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005200
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 18 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0057b0
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 7 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005278
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 19 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005828
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 8 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580052f0
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 20 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0058a0
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 9 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005368
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 21 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005918
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 10 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580053e0
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 22 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005990
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 11 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005458
instance-12628:17665:17873 [0] NCCL INFO New proxy recv connection 23 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005a08
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 12 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580054d0
instance-12628:17665:17665 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 24 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005a80
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 13 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005548
instance-12628:17665:17665 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 25 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005af8
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 14 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580055c0
instance-12628:17665:17665 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 26 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005b70
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 15 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005638
instance-12628:17665:17665 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 27 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005be8
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 16 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580056b0
instance-12628:17665:17665 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 28 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005c60
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 17 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005728
instance-12628:17665:17665 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 29 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005cd8
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 18 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580057a0
instance-12628:17665:17665 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 30 from local rank 0, transport 0
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 19 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005818
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005d50
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 20 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005890
instance-12628:17665:17665 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 31 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005dc8
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 21 from local rank 1, transport 0
instance-12628:17665:17665 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005908
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 32 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005e40
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 22 from local rank 1, transport 0
instance-12628:17665:17665 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005980
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 33 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005eb8
instance-12628:17666:17872 [1] NCCL INFO New proxy recv connection 23 from local rank 1, transport 0
instance-12628:17665:17665 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580059f8
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 34 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005f30
instance-12628:17666:17666 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17665:17665 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 24 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005a70
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 35 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c005fa8
instance-12628:17666:17666 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17665:17665 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 25 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005ae8
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 36 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006020
instance-12628:17666:17666 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17665:17665 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 26 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005b60
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 37 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006098
instance-12628:17666:17666 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17665:17665 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 27 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005bd8
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 38 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006110
instance-12628:17666:17666 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 28 from local rank 1, transport 0
instance-12628:17665:17665 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005c50
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 39 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006188
instance-12628:17666:17666 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17665:17665 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 29 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005cc8
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 40 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006200
instance-12628:17666:17666 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17665:17665 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 30 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005d40
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 41 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006278
instance-12628:17666:17666 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17665:17665 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 31 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005db8
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 42 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0062f0
instance-12628:17666:17666 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 32 from local rank 1, transport 0
instance-12628:17665:17665 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005e30
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 43 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006368
instance-12628:17666:17666 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 33 from local rank 1, transport 0
instance-12628:17665:17665 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005ea8
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 44 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0063e0
instance-12628:17666:17666 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 34 from local rank 1, transport 0
instance-12628:17665:17665 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005f20
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 45 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006458
instance-12628:17666:17666 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 35 from local rank 1, transport 0
instance-12628:17665:17665 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58005f98
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 46 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0064d0
instance-12628:17666:17666 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17665:17665 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 36 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006010
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 47 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006548
instance-12628:17666:17666 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 37 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006088
instance-12628:17666:17666 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 38 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006100
instance-12628:17666:17666 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 39 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006178
instance-12628:17666:17666 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 40 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580061f0
instance-12628:17666:17666 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 41 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006268
instance-12628:17666:17666 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 42 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580062e0
instance-12628:17666:17666 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 43 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006358
instance-12628:17666:17666 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 44 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580063d0
instance-12628:17666:17666 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 45 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006448
instance-12628:17666:17666 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 46 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580064c0
instance-12628:17666:17666 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM/read
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 47 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006538
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 48 from local rank 0, transport 0
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 48 from local rank 1, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580065b0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0065c0
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 49 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006628
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 49 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006638
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 50 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580066a0
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 50 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0066b0
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 51 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006718
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 51 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006728
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 52 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006790
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 52 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0067a0
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 53 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006808
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 53 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006818
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 54 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006880
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 54 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006890
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 55 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006908
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 56 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006980
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 57 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0069f8
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 58 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006a70
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 55 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580068f8
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 59 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006ae8
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 56 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006970
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 60 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006b60
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 57 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580069e8
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 61 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006bd8
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 58 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006a60
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 62 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006c50
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 59 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006ad8
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 63 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006cc8
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 60 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006b50
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 64 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006d40
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 61 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006bc8
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 65 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006db8
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 62 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006c40
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 66 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006e30
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 63 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006cb8
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 67 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006ea8
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 64 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006d30
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 68 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006f20
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 65 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006da8
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 69 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c006f98
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 66 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006e20
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 70 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007010
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 67 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006e98
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 71 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007088
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 68 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006f10
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 72 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007100
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 69 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58006f88
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 73 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007178
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 70 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007000
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 74 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0071f0
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 71 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007078
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 75 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007268
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 72 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580070f0
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 76 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0072e0
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 73 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007168
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 77 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007358
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 74 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580071e0
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 78 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0073d0
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 75 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007258
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 79 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007448
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 76 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580072d0
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 80 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0074c0
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 77 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007348
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 81 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007538
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 78 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580073c0
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 82 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0075b0
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 79 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007438
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 83 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007628
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 80 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580074b0
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 84 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0076a0
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 81 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007528
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 85 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007718
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 82 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580075a0
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 86 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007790
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 83 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007618
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 87 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007808
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 84 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007690
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 88 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007880
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 85 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007708
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 89 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0078f8
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 86 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007780
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 90 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007970
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 87 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580077f8
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 91 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c0079e8
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 88 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007870
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 92 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007a60
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 89 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580078e8
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 93 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007ad8
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 90 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007960
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 94 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007b50
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 91 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b580079d8
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 95 from local rank 1, transport 0
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007bc8
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 92 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007a50
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 93 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007ac8
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 94 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007b40
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 95 from local rank 0, transport 0
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007bb8
instance-12628:17666:17666 [1] NCCL INFO Connected all rings
instance-12628:17666:17666 [1] NCCL INFO Connected all trees
instance-12628:17665:17665 [0] NCCL INFO Connected all rings
instance-12628:17666:17666 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
instance-12628:17666:17666 [1] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
instance-12628:17665:17665 [0] NCCL INFO Connected all trees
instance-12628:17665:17665 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
instance-12628:17665:17665 [0] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
instance-12628:17666:17872 [1] NCCL INFO New proxy send connection 96 from local rank 1, transport 2
instance-12628:17666:17666 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2b58007c30
instance-12628:17665:17873 [0] NCCL INFO New proxy send connection 96 from local rank 0, transport 2
instance-12628:17665:17665 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f260c007c40
instance-12628:17666:17666 [1] NCCL INFO comm 0x561b8dbdac50 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId cb000 commId 0x18e4fe8089d2455d - Init COMPLETE
instance-12628:17665:17665 [0] NCCL INFO comm 0x5646eba11280 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 7000 commId 0x18e4fe8089d2455d - Init COMPLETE
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +66032, now: CPU 0, GPU 66032 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +66032, now: CPU 0, GPU 66032 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 66388, GPU 67775 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 66388, GPU 67775 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 66388, GPU 67783 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 66388, GPU 67783 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 66032 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 66032 (MiB)
[TensorRT-LLM][INFO] Allocate 12540968960 bytes for k/v cache.
[TensorRT-LLM][INFO] Using 76544 tokens in paged KV cache.
[TensorRT-LLM][INFO] Allocate 12540968960 bytes for k/v cache.
[TensorRT-LLM][INFO] Using 76544 tokens in paged KV cache.

instance-12628:17665:17665 [0] init.cc:303 NCCL WARN Attempt to use communicator before the previous operation returned ncclSuccess
instance-12628:17665:17665 [0] NCCL INFO enqueue.cc:1605 -> 512
instance-12628:17665:17665 [0] NCCL INFO enqueue.cc:1623 -> 512
Failed, NCCL error /home/user/TensorRT-LLM/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 'unknown result code'
instance-12628:17666:17872 [1] NCCL INFO [Service thread] Connection closed by localRank 0
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

instance-12628:17666:17666 [1] init.cc:303 NCCL WARN Attempt to use communicator before the previous operation returned ncclSuccess
instance-12628:17666:17666 [1] NCCL INFO enqueue.cc:1605 -> 512
instance-12628:17666:17666 [1] NCCL INFO enqueue.cc:1623 -> 512
Failed, NCCL error /home/user/TensorRT-LLM/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 'unknown result code'

jacob-crux avatar Feb 05 '24 17:02 jacob-crux

I am experimenting with smoothquant, and an error occurred during checkpoint conversion with the command.

When I download the llama2-7b model from huggingface and convert the checkpoint with smooth quantization, an error appears as follows. The default dtype of llama2 model is bfloat16, but there is a problem in that numpy does not support the bfloat dtype. Transformation appears to be necessary before quantizing the model learned with bfloat16.

[TensorRT-LLM] TensorRT-LLM version: 0.8.0.dev20240130000.8.0.dev2024013000
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.65it/s]
/home/user/.local/lib/python3.10/site-packages/datasets/load.py:1429: FutureWarning: The repository for ccdv/cnn_dailymail contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ccdv/cnn_dailymail
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
calibrating model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [00:45<00:00, 11.31it/s]
Traceback (most recent call last):
  File "/data/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1971, in <module>
    main()
  File "/data/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1956, in main
    covert_and_save(rank, convert_args)
  File "/data/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1893, in covert_and_save
    weights = convert_hf_llama(
  File "/data/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1306, in convert_hf_llama
    int8_weights = generate_int8(qkv_weight,
  File "/data/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 319, in generate_int8
    weights = weights.detach().cpu().numpy()
TypeError: Got unsupported ScalarType BFloat16

jacob-crux avatar Feb 05 '24 17:02 jacob-crux

I am experimenting with smoothquant, and an error occurred during checkpoint conversion with the command.

When I download the llama2-7b model from huggingface and convert the checkpoint with smooth quantization, an error appears as follows. The default dtype of llama2 model is bfloat16, but there is a problem in that numpy does not support the bfloat dtype. Transformation appears to be necessary before quantizing the model learned with bfloat16.

[TensorRT-LLM] TensorRT-LLM version: 0.8.0.dev20240130000.8.0.dev2024013000
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.65it/s]
/home/user/.local/lib/python3.10/site-packages/datasets/load.py:1429: FutureWarning: The repository for ccdv/cnn_dailymail contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ccdv/cnn_dailymail
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
calibrating model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [00:45<00:00, 11.31it/s]
Traceback (most recent call last):
  File "/data/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1971, in <module>
    main()
  File "/data/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1956, in main
    covert_and_save(rank, convert_args)
  File "/data/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1893, in covert_and_save
    weights = convert_hf_llama(
  File "/data/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1306, in convert_hf_llama
    int8_weights = generate_int8(qkv_weight,
  File "/data/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 319, in generate_int8
    weights = weights.detach().cpu().numpy()
TypeError: Got unsupported ScalarType BFloat16

you can try the torch_to_numpy converter here (https://github.com/NVIDIA/TensorRT-LLM/blob/v0.7.1/tensorrt_llm/_utils.py#L33-L38) as a WAR. we will fix that soon. thanks for reporting this.

PerkzZheng avatar Feb 06 '24 03:02 PerkzZheng

@khj94 can you try NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,COLL,NET, and save the log as a txt file ? Please make sure you set the correct ENV or we may not see detailed logs. Thanks.

PerkzZheng avatar Feb 06 '24 09:02 PerkzZheng

@khj94 Sorry for asking @PerkzZheng to ask again -- my bad, I just realized the second log had the right NCCL_DEBUG_SUBSYS set and I'm not seeing anything because the error happens before we print the debug line I was looking for. I need to look at how NCCL is used and see if something is wrong there (or in NCCL).

sjeaugey avatar Feb 06 '24 09:02 sjeaugey

I am experimenting with smoothquant, and an error occurred during checkpoint conversion with the command. When I download the llama2-7b model from huggingface and convert the checkpoint with smooth quantization, an error appears as follows. The default dtype of llama2 model is bfloat16, but there is a problem in that numpy does not support the bfloat dtype. Transformation appears to be necessary before quantizing the model learned with bfloat16.

[TensorRT-LLM] TensorRT-LLM version: 0.8.0.dev20240130000.8.0.dev2024013000
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.65it/s]
/home/user/.local/lib/python3.10/site-packages/datasets/load.py:1429: FutureWarning: The repository for ccdv/cnn_dailymail contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ccdv/cnn_dailymail
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
calibrating model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [00:45<00:00, 11.31it/s]
Traceback (most recent call last):
  File "/data/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1971, in <module>
    main()
  File "/data/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1956, in main
    covert_and_save(rank, convert_args)
  File "/data/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1893, in covert_and_save
    weights = convert_hf_llama(
  File "/data/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1306, in convert_hf_llama
    int8_weights = generate_int8(qkv_weight,
  File "/data/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 319, in generate_int8
    weights = weights.detach().cpu().numpy()
TypeError: Got unsupported ScalarType BFloat16

you can try the torch_to_numpy converter here (https://github.com/NVIDIA/TensorRT-LLM/blob/v0.7.1/tensorrt_llm/_utils.py#L33-L38) as a WAR. we will fix that soon. thanks for reporting this.

a new error happened: image

Hukongtao avatar Feb 20 '24 10:02 Hukongtao

@khj94 can you try the fix shown here ? https://github.com/NVIDIA/TensorRT-LLM/issues/1131#issuecomment-1968641974

PerkzZheng avatar Feb 28 '24 10:02 PerkzZheng