deep-learning-containers icon indicating copy to clipboard operation
deep-learning-containers copied to clipboard

[bug] Couldn't initialize SMDDP on HuggingFace Training Containers

Open rohit901 opened this issue 1 year ago • 1 comments
trafficstars

Checklist

  • [x] I've prepended issue tag with type of change: [bug]
  • [ ] (If applicable) I've attached the script to reproduce the bug
  • [x] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • [ ] (If applicable) I've documented below the tests I've run on the DLC image
  • [x] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
  • [ ] I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description: I'm using the following HuggingFace container from here

py_version: py310 pytorch_version: 2.0.0 region: us-east-1 transformers_version: 4.28.1

I'm using the instance: ml.p4d.24xlarge (8x A100), and have enabled data parallel mode. I'm running my job using accelerate script and accelerate sagemaker config: config.yaml:

base_job_name: accelerate-sagemaker-1
compute_environment: AMAZON_SAGEMAKER
debug: false
distributed_type: DATA_PARALLEL
ec2_instance_type: ml.p4d.24xlarge
gpu_ids: all
iam_role_name: xxxx
mixed_precision: fp16
num_machines: 1
profile: default
py_version: py310
pytorch_version: 2.0.0
region: us-east-1
transformers_version: 4.28.1
use_cpu: false

However, my script fails with some NCCL error. Initially, I used a different pytorch version (i.e. 2.1.0) and was facing issue saying smdistributed not found, I've described my issue here: https://github.com/aws/deep-learning-containers/issues/3627#issuecomment-1978089042

Now using a different container version, I'm getting NCCL errors.

snippets from the logs:

[1,mpirank:0,algo-1]<stdout>:Running smdistributed.dataparallel v1.8.0
[1,mpirank:0,algo-1]<stdout>:SMDDP: Single node mode
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[1,mpirank:6,algo-1]<stdout>:algo-1:128:237 [6] NCCL INFO [Service thread] Connection closed by localRank 5
[1,mpirank:6,algo-1]<stdout>:algo-1:128:237 [6] NCCL INFO [Service thread] Connection closed by localRank 7
[algo-1:00100] 7 more processes have sent help message help-mpi-api.txt / mpi-abort
[algo-1:00100] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2024-03-05 07:28:26,193 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.
2024-03-05 07:28:26,193 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 1 from exiting process.
2024-03-05 07:28:26,194 sagemaker-training-toolkit ERROR    Reporting training FAILURE
2024-03-05 07:28:26,194 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise RuntimeError("""
 RuntimeError
 Couldn't initialize SMDDP.
 Expected mechanism for checking for NCCL backend has changed.
 Expected defintion for _check_for_nccl_backend in distributed_c10d. Found None.
 
 Traceback (most recent call last)
 File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
 return _run_code(code, main_globals, None,
 File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
 exec(code, run_globals)
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/__main__.py", line 7, in <module>
 main()
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 198, in main
 run_command_line(args)
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 47, in run_command_line
 run_path(sys.argv[0], run_name='__main__')
 File "/opt/conda/lib/python3.10/runpy.py", line 289, in run_path
 return _run_module_code(code, init_globals, run_name,
 File "/opt/conda/lib/python3.10/runpy.py", line 96, in _run_module_code
 _run_code(code, mod_globals, init_globals,
 File "train_vlcm_distill_lcm_wds.py", line 1416, in <module>
 main(args)
 File "train_vlcm_distill_lcm_wds.py", line 780, in main
 accelerator = Accelerator(
 File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 361, in __init__
 self.state = AcceleratorState(
 File "/opt/conda/lib/python3.10/site-packages/accelerate/state.py", line 549, in __init__
 PartialState(cpu, **kwargs)
 File "/opt/conda/lib/python3.10/site-packages/accelerate/state.py", line 95, in __init__
 import smdistributed.dataparallel.torch.torch_smddp  # noqa
 File "/opt/conda/lib/python3.10/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 33, in <module>
 raise RuntimeError("""
 algo-1:133:133 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
 algo-1:133:133 [0] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
 algo-1:133:133 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
 algo-1:133:133 [0] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
 algo-1:133:133 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
 algo-1:133:133 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
 algo-1:133:133 [0] NCCL INFO cudaDriverVersion 12020
 NCCL version 2.19.3+cuda12.3
 algo-1:129:205 [7] NCCL INFO cudaDriverVersion 12020
 algo-1:129:205 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
 algo-1:129:205 [7] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
 algo-1:129:205 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
 algo-1:129:205 [7] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
 algo-1:129:205 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
 algo-1:129:205 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
 algo-1:124:209 [4] NCCL INFO cudaDriverVersion 12020
 algo-1:124:209 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
 algo-1:124:209 [4] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
 algo-1:113:113 [1] NCCL INFO cudaDriverVersion 12020
 algo-1:113:113 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
 algo-1:124:209 [4] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
 algo-1:124:209 [4] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
 algo-1:124:209 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
 algo-1:124:209 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
 algo-1:113:113 [1] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
 algo-1:128:197 [6] NCCL INFO cudaDriverVersion 12020
 algo-1:128:197 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
 algo-1:116:211 [2] NCCL INFO cudaDriverVersion 12020
 algo-1:128:197 [6] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
 algo-1:116:211 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
 algo-1:116:211 [2] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
 algo-1:113:113 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
 algo-1:113:113 [1] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
 algo-1:113:113 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
 algo-1:113:113 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
 algo-1:126:199 [5] NCCL INFO cudaDriverVersion 12020
 algo-1:126:199 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
 algo-1:122:122 [3] NCCL INFO cudaDriverVersion 12020
 algo-1:126:199 [5] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
 algo-1:122:122 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
 algo-1:128:197 [6] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
 algo-1:128:197 [6] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
 algo-1:128:197 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
 algo-1:128:197 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
 algo-1:116:211 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
 algo-1:116:211 [2] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
 algo-1:116:211 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
 algo-1:116:211 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
 algo-1:122:122 [3] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
 algo-1:126:199 [5] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
 algo-1:126:199 [5] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
 algo-1:126:199 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
 algo-1:126:199 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
 algo-1:122:122 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
 algo-1:122:122 [3] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
 algo-1:122:122 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
 algo-1:122:122 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
 algo-1:133:133 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
 algo-1:133:133 [0] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:133:133 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
 algo-1:133:133 [0] NCCL INFO NET/OFI Selected Provider is efa
 algo-1:133:133 [0] NCCL INFO Using non-device net plugin version 0
 algo-1:133:133 [0] NCCL INFO Using network AWS Libfabric
 algo-1:133:133 [0] NCCL INFO DMA-BUF is available on GPU device 0
 algo-1:116:211 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
 algo-1:116:211 [2] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:116:211 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
 algo-1:113:113 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
 algo-1:113:113 [1] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:113:113 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
 algo-1:116:211 [2] NCCL INFO NET/OFI Selected Provider is efa
 algo-1:116:211 [2] NCCL INFO Using non-device net plugin version 0
 algo-1:116:211 [2] NCCL INFO Using network AWS Libfabric
 algo-1:113:113 [1] NCCL INFO NET/OFI Selected Provider is efa
 algo-1:113:113 [1] NCCL INFO Using non-device net plugin version 0
 algo-1:113:113 [1] NCCL INFO Using network AWS Libfabric
 algo-1:129:205 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
 algo-1:129:205 [7] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:129:205 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
 algo-1:126:199 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
 algo-1:126:199 [5] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:126:199 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
 algo-1:128:197 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
 algo-1:128:197 [6] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:128:197 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
 algo-1:129:205 [7] NCCL INFO NET/OFI Selected Provider is efa
 algo-1:129:205 [7] NCCL INFO Using non-device net plugin version 0
 algo-1:129:205 [7] NCCL INFO Using network AWS Libfabric
 algo-1:126:199 [5] NCCL INFO NET/OFI Selected Provider is efa
 algo-1:126:199 [5] NCCL INFO Using non-device net plugin version 0
 algo-1:126:199 [5] NCCL INFO Using network AWS Libfabric
 algo-1:128:197 [6] NCCL INFO NET/OFI Selected Provider is efa
 algo-1:128:197 [6] NCCL INFO Using non-device net plugin version 0
 algo-1:128:197 [6] NCCL INFO Using network AWS Libfabric
 algo-1:122:122 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
 algo-1:122:122 [3] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:122:122 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
 algo-1:122:122 [3] NCCL INFO NET/OFI Selected Provider is efa
 algo-1:122:122 [3] NCCL INFO Using non-device net plugin version 0
 algo-1:122:122 [3] NCCL INFO Using network AWS Libfabric
 algo-1:124:209 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
 algo-1:124:209 [4] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:124:209 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
 algo-1:124:209 [4] NCCL INFO NET/OFI Selected Provider is efa
 algo-1:124:209 [4] NCCL INFO Using non-device net plugin version 0
 algo-1:124:209 [4] NCCL INFO Using network AWS Libfabric
 algo-1:116:211 [2] NCCL INFO DMA-BUF is available on GPU device 2
 algo-1:113:113 [1] NCCL INFO DMA-BUF is available on GPU device 1
 algo-1:129:205 [7] NCCL INFO DMA-BUF is available on GPU device 7
 algo-1:126:199 [5] NCCL INFO DMA-BUF is available on GPU device 5
 algo-1:128:197 [6] NCCL INFO DMA-BUF is available on GPU device 6
 algo-1:122:122 [3] NCCL INFO DMA-BUF is available on GPU device 3
 algo-1:124:209 [4] NCCL INFO DMA-BUF is available on GPU device 4
 algo-1:129:205 [7] NCCL INFO comm 0x7f7d94f30600 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId a01d0 commId 0x3421509e5ba66223 - Init START
 algo-1:128:197 [6] NCCL INFO comm 0x7f7a78f313a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId a01c0 commId 0x3421509e5ba66223 - Init START
 algo-1:133:133 [0] NCCL INFO comm 0x55bc7e222630 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 101c0 commId 0x3421509e5ba66223 - Init START
 algo-1:126:199 [5] NCCL INFO comm 0x7f29ecf30790 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 901d0 commId 0x3421509e5ba66223 - Init START
 algo-1:113:113 [1] NCCL INFO comm 0x563ffd80ffc0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 101d0 commId 0x3421509e5ba66223 - Init START
 algo-1:122:122 [3] NCCL INFO comm 0x564794a485a0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 201d0 commId 0x3421509e5ba66223 - Init START
 algo-1:124:209 [4] NCCL INFO comm 0x7fa600f314c0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 901c0 commId 0x3421509e5ba66223 - Init START
 algo-1:116:211 [2] NCCL INFO comm 0x7f69d8f30e80 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 201c0 commId 0x3421509e5ba66223 - Init START
 algo-1:129:205 [7] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:128:197 [6] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:116:211 [2] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:122:122 [3] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:124:209 [4] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:133:133 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:126:199 [5] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:113:113 [1] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 algo-1:129:205 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
 algo-1:129:205 [7] NCCL INFO NVLS multicast support is not available on dev 7
 algo-1:126:199 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
 algo-1:126:199 [5] NCCL INFO NVLS multicast support is not available on dev 5
 algo-1:128:197 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
 algo-1:128:197 [6] NCCL INFO NVLS multicast support is not available on dev 6
 algo-1:113:113 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
 algo-1:113:113 [1] NCCL INFO NVLS multicast support is not available on dev 1
 algo-1:116:211 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
 algo-1:116:211 [2] NCCL INFO NVLS multicast support is not available on dev 2
 algo-1:124:209 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
 algo-1:124:209 [4] NCCL INFO NVLS multicast support is not available on dev 4
 algo-1:122:122 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
 algo-1:122:122 [3] NCCL INFO NVLS multicast support is not available on dev 3
 algo-1:133:133 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
 algo-1:133:133 [0] NCCL INFO NVLS multicast support is not available on dev 0
 algo-1:133:133 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
 algo-1:113:113 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
 algo-1:113:113 [1] NCCL INFO P2P Chunksize set to 524288
 algo-1:116:211 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
 algo-1:116:211 [2] NCCL INFO P2P Chunksize set to 524288
 algo-1:122:122 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
 algo-1:122:122 [3] NCCL INFO P2P Chunksize set to 524288
 algo-1:133:133 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
 algo-1:126:199 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
 algo-1:126:199 [5] NCCL INFO P2P Chunksize set to 524288
 algo-1:124:209 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
 algo-1:124:209 [4] NCCL INFO P2P Chunksize set to 524288
 algo-1:128:197 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
 algo-1:128:197 [6] NCCL INFO P2P Chunksize set to 524288
 algo-1:129:205 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
 algo-1:129:205 [7] NCCL INFO P2P Chunksize set to 524288
 algo-1:133:133 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
 algo-1:133:133 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
 algo-1:133:133 [0] NCCL INFO P2P Chunksize set to 524288
 algo-1:113:113 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Connected all rings
 algo-1:113:113 [1] NCCL INFO Connected all rings
 algo-1:133:133 [0] NCCL INFO Connected all rings
 algo-1:129:205 [7] NCCL INFO Connected all rings
 algo-1:129:205 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Connected all rings
 algo-1:129:205 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Connected all rings
 algo-1:129:205 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Connected all rings
 algo-1:129:205 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Connected all rings
 algo-1:116:211 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 10/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 11/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 12/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 13/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 14/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 16/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 17/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 18/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 19/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 20/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 21/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 22/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:129:205 [7] NCCL INFO Channel 23/0 : 7[7] -> 6[6] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 20/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 22/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:116:211 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 16/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 17/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 18/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 19/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 20/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 21/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 22/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 16/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:128:197 [6] NCCL INFO Channel 23/0 : 6[6] -> 5[5] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 17/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 18/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 16/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 19/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 17/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:113:113 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 20/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 18/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 16/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 21/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 19/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 22/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 20/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 17/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:124:209 [4] NCCL INFO Channel 23/0 : 4[4] -> 3[3] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 21/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 18/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 22/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:126:199 [5] NCCL INFO Channel 23/0 : 5[5] -> 4[4] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 19/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 20/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 21/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 22/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:122:122 [3] NCCL INFO Channel 23/0 : 3[3] -> 2[2] via P2P/CUMEM/read
 algo-1:133:133 [0] NCCL INFO Connected all trees
 algo-1:133:133 [0] NCCL INFO NCCL_PROTO set by environment to simple
 algo-1:133:133 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 algo-1:133:133 [0] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 algo-1:113:113 [1] NCCL INFO Connected all trees
 algo-1:113:113 [1] NCCL INFO NCCL_PROTO set by environment to simple
 algo-1:113:113 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 algo-1:113:113 [1] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 algo-1:116:211 [2] NCCL INFO Connected all trees
 algo-1:116:211 [2] NCCL INFO NCCL_PROTO set by environment to simple
 algo-1:116:211 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 algo-1:116:211 [2] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 algo-1:129:205 [7] NCCL INFO Connected all trees
 algo-1:129:205 [7] NCCL INFO NCCL_PROTO set by environment to simple
 algo-1:129:205 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 algo-1:129:205 [7] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 algo-1:122:122 [3] NCCL INFO Connected all trees
 algo-1:122:122 [3] NCCL INFO NCCL_PROTO set by environment to simple
 algo-1:122:122 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 algo-1:122:122 [3] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 algo-1:124:209 [4] NCCL INFO Connected all trees
 algo-1:124:209 [4] NCCL INFO NCCL_PROTO set by environment to simple
 algo-1:124:209 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 algo-1:124:209 [4] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 algo-1:128:197 [6] NCCL INFO Connected all trees
 algo-1:128:197 [6] NCCL INFO NCCL_PROTO set by environment to simple
 algo-1:126:199 [5] NCCL INFO Connected all trees
 algo-1:128:197 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 algo-1:128:197 [6] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 algo-1:126:199 [5] NCCL INFO NCCL_PROTO set by environment to simple
 algo-1:126:199 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 algo-1:126:199 [5] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 algo-1:122:122 [3] NCCL INFO comm 0x564794a485a0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 201d0 commId 0x3421509e5ba66223 - Init COMPLETE
 algo-1:116:211 [2] NCCL INFO comm 0x7f69d8f30e80 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 201c0 commId 0x3421509e5ba66223 - Init COMPLETE
 algo-1:113:113 [1] NCCL INFO comm 0x563ffd80ffc0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 101d0 commId 0x3421509e5ba66223 - Init COMPLETE
 algo-1:129:205 [7] NCCL INFO comm 0x7f7d94f30600 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId a01d0 commId 0x3421509e5ba66223 - Init COMPLETE
 algo-1:133:133 [0] NCCL INFO comm 0x55bc7e222630 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 101c0 commId 0x3421509e5ba66223 - Init COMPLETE
 algo-1:124:209 [4] NCCL INFO comm 0x7fa600f314c0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 901c0 commId 0x3421509e5ba66223 - Init COMPLETE
 algo-1:128:197 [6] NCCL INFO comm 0x7f7a78f313a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId a01c0 commId 0x3421509e5ba66223 - Init COMPLETE
 algo-1:126:199 [5] NCCL INFO comm 0x7f29ecf30790 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 901d0 commId 0x3421509e5ba66223 - Init COMPLETE
 Running smdistributed.dataparallel v1.8.0
 SMDDP: Single node mode
 algo-1:128:237 [6] NCCL INFO [Service thread] Connection closed by localRank 5
 algo-1:128:237 [6] NCCL INFO [Service thread] Connection closed by localRank 7"
Command "mpirun --host algo-1 -np 8 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 1 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_SINGLENODE=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.10/site-packages/gethostname.cpython-310-x86_64-linux-gnu.so -x NCCL_PROTO=simple -x FI_EFA_USE_DEVICE_RDMA=1 smddprun /opt/conda/bin/python3.10 -m mpi4py train_vlcm_distill_lcm_wds.py --adam_weight_decay 0 --checkpointing_steps 200 --checkpoints_total_limit 10 --dataloader_num_workers 8 --ema_decay 0.95 --enable_xformers_memory_efficient_attention True --gradient_accumulation_steps 1 --gradient_checkpointing True --learning_rate 1e-06 --loss_type huber --max_train_samples 10727607 --max_train_steps 10727607 --mixed_precision fp16 --pretrained_teacher_model damo-vilab/text-to-video-ms-1.7b --resolution 512 --resume_from_checkpoint latest --seed 453645634 --train_batch_size 16 --use_8bit_adam True --validation_steps 200"
2024-03-05 07:28:26,194 sagemaker-training-toolkit ERROR    Encountered exit_code 1

2024-03-05 07:29:19 Uploading - Uploading generated training model
2024-03-05 07:29:19 Failed - Training job failed
Traceback (most recent call last):
  File "/home/rohit.bharadwaj/.conda/envs/LCM/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1021, in launch_command
    sagemaker_launcher(defaults, args)
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/accelerate/commands/launch.py", line 840, in sagemaker_launcher
    huggingface_estimator.fit(inputs=sagemaker_inputs)
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/workflow/pipeline_context.py", line 346, in wrapper
    return run_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/estimator.py", line 1341, in fit
    self.latest_training_job.wait(logs=logs)
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/estimator.py", line 2677, in wait
    self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/session.py", line 5568, in logs_for_job
    _logs_for_job(self, job_name, wait, poll, log_type, timeout)
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/session.py", line 7711, in _logs_for_job
    _check_job_status(job_name, description, "TrainingJobStatus")
  File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/session.py", line 7764, in _check_job_status
    raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job accelerate-sagemaker-1-2024-03-05-07-15-53-204: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise RuntimeError("""
 RuntimeError
 Couldn't initialize SMDDP.
 Expected mechanism for checking for NCCL backend has changed.
 Expected defintion for _check_for_nccl_backend in distributed_c10d. Found None.
 
 Traceback (most recent call last)
 File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
 return _run_code(code, main_globals, None,
 File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
 exec(code, run_globals)
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/__main__.py", line 7, in <module>
 main()
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 198, in main
 run_command_line(args)
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 47, in run_command_line
 run_path(sys.argv[0], run_name='__main__')
 File "/opt/conda/lib/python3.10/runpy.py", line 289, in run_path
 return _run_module_code(code, init_globals, run_name,
 File "/opt/conda/lib/python3.10/runpy.py",

DLC image/dockerfile: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04

Current behavior: shown in the logs above

Expected behavior: Model should train on multiple GPUs.

Additional context:

rohit901 avatar Mar 05 '24 07:03 rohit901

I think this issue is related to Pytorch version 2.2.0 or CUDA 12. One of my dependency (xformers) was forcing the install of latest pytorch version, and was causing this issue. I hope this can be fixed.

rohit901 avatar Mar 11 '24 07:03 rohit901