deep-learning-containers
deep-learning-containers copied to clipboard
[bug] Couldn't initialize SMDDP on HuggingFace Training Containers
Checklist
- [x] I've prepended issue tag with type of change: [bug]
- [ ] (If applicable) I've attached the script to reproduce the bug
- [x] (If applicable) I've documented below the DLC image/dockerfile this relates to
- [ ] (If applicable) I've documented below the tests I've run on the DLC image
- [x] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
- [ ] I've built my own container based off DLC (and I've attached the code used to build my own image)
Concise Description: I'm using the following HuggingFace container from here
py_version: py310 pytorch_version: 2.0.0 region: us-east-1 transformers_version: 4.28.1
I'm using the instance: ml.p4d.24xlarge (8x A100), and have enabled data parallel mode. I'm running my job using accelerate script and accelerate sagemaker config: config.yaml:
base_job_name: accelerate-sagemaker-1
compute_environment: AMAZON_SAGEMAKER
debug: false
distributed_type: DATA_PARALLEL
ec2_instance_type: ml.p4d.24xlarge
gpu_ids: all
iam_role_name: xxxx
mixed_precision: fp16
num_machines: 1
profile: default
py_version: py310
pytorch_version: 2.0.0
region: us-east-1
transformers_version: 4.28.1
use_cpu: false
However, my script fails with some NCCL error. Initially, I used a different pytorch version (i.e. 2.1.0) and was facing issue saying smdistributed not found, I've described my issue here: https://github.com/aws/deep-learning-containers/issues/3627#issuecomment-1978089042
Now using a different container version, I'm getting NCCL errors.
snippets from the logs:
[1,mpirank:0,algo-1]<stdout>:Running smdistributed.dataparallel v1.8.0
[1,mpirank:0,algo-1]<stdout>:SMDDP: Single node mode
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[1,mpirank:6,algo-1]<stdout>:algo-1:128:237 [6] NCCL INFO [Service thread] Connection closed by localRank 5
[1,mpirank:6,algo-1]<stdout>:algo-1:128:237 [6] NCCL INFO [Service thread] Connection closed by localRank 7
[algo-1:00100] 7 more processes have sent help message help-mpi-api.txt / mpi-abort
[algo-1:00100] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2024-03-05 07:28:26,193 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2024-03-05 07:28:26,193 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
2024-03-05 07:28:26,194 sagemaker-training-toolkit ERROR Reporting training FAILURE
2024-03-05 07:28:26,194 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise RuntimeError("""
RuntimeError
Couldn't initialize SMDDP.
Expected mechanism for checking for NCCL backend has changed.
Expected defintion for _check_for_nccl_backend in distributed_c10d. Found None.
Traceback (most recent call last)
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/mpi4py/__main__.py", line 7, in <module>
main()
File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 198, in main
run_command_line(args)
File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 47, in run_command_line
run_path(sys.argv[0], run_name='__main__')
File "/opt/conda/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/opt/conda/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "train_vlcm_distill_lcm_wds.py", line 1416, in <module>
main(args)
File "train_vlcm_distill_lcm_wds.py", line 780, in main
accelerator = Accelerator(
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 361, in __init__
self.state = AcceleratorState(
File "/opt/conda/lib/python3.10/site-packages/accelerate/state.py", line 549, in __init__
PartialState(cpu, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/state.py", line 95, in __init__
import smdistributed.dataparallel.torch.torch_smddp # noqa
File "/opt/conda/lib/python3.10/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 33, in <module>
raise RuntimeError("""
algo-1:133:133 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
algo-1:133:133 [0] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
algo-1:133:133 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
algo-1:133:133 [0] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
algo-1:133:133 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
algo-1:133:133 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
algo-1:133:133 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda12.3
algo-1:129:205 [7] NCCL INFO cudaDriverVersion 12020
algo-1:129:205 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
algo-1:129:205 [7] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
algo-1:129:205 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
algo-1:129:205 [7] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
algo-1:129:205 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
algo-1:129:205 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
algo-1:124:209 [4] NCCL INFO cudaDriverVersion 12020
algo-1:124:209 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
algo-1:124:209 [4] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
algo-1:113:113 [1] NCCL INFO cudaDriverVersion 12020
algo-1:113:113 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
algo-1:124:209 [4] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
algo-1:124:209 [4] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
algo-1:124:209 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
algo-1:124:209 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
algo-1:113:113 [1] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
algo-1:128:197 [6] NCCL INFO cudaDriverVersion 12020
algo-1:128:197 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
algo-1:116:211 [2] NCCL INFO cudaDriverVersion 12020
algo-1:128:197 [6] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
algo-1:116:211 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
algo-1:116:211 [2] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
algo-1:113:113 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
algo-1:113:113 [1] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
algo-1:113:113 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
algo-1:113:113 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
algo-1:126:199 [5] NCCL INFO cudaDriverVersion 12020
algo-1:126:199 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
algo-1:122:122 [3] NCCL INFO cudaDriverVersion 12020
algo-1:126:199 [5] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
algo-1:122:122 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
algo-1:128:197 [6] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
algo-1:128:197 [6] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
algo-1:128:197 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
algo-1:128:197 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
algo-1:116:211 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
algo-1:116:211 [2] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
algo-1:116:211 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
algo-1:116:211 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
algo-1:122:122 [3] NCCL INFO Bootstrap : Using eth0:10.0.224.233<0>
algo-1:126:199 [5] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
algo-1:126:199 [5] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
algo-1:126:199 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
algo-1:126:199 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
algo-1:122:122 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
algo-1:122:122 [3] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
algo-1:122:122 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
algo-1:122:122 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
algo-1:133:133 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
algo-1:133:133 [0] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:133:133 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:133:133 [0] NCCL INFO NET/OFI Selected Provider is efa
algo-1:133:133 [0] NCCL INFO Using non-device net plugin version 0
algo-1:133:133 [0] NCCL INFO Using network AWS Libfabric
algo-1:133:133 [0] NCCL INFO DMA-BUF is available on GPU device 0
algo-1:116:211 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
algo-1:116:211 [2] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:116:211 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:113:113 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
algo-1:113:113 [1] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:113:113 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:116:211 [2] NCCL INFO NET/OFI Selected Provider is efa
algo-1:116:211 [2] NCCL INFO Using non-device net plugin version 0
algo-1:116:211 [2] NCCL INFO Using network AWS Libfabric
algo-1:113:113 [1] NCCL INFO NET/OFI Selected Provider is efa
algo-1:113:113 [1] NCCL INFO Using non-device net plugin version 0
algo-1:113:113 [1] NCCL INFO Using network AWS Libfabric
algo-1:129:205 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
algo-1:129:205 [7] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:129:205 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:126:199 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
algo-1:126:199 [5] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:126:199 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:128:197 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
algo-1:128:197 [6] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:128:197 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:129:205 [7] NCCL INFO NET/OFI Selected Provider is efa
algo-1:129:205 [7] NCCL INFO Using non-device net plugin version 0
algo-1:129:205 [7] NCCL INFO Using network AWS Libfabric
algo-1:126:199 [5] NCCL INFO NET/OFI Selected Provider is efa
algo-1:126:199 [5] NCCL INFO Using non-device net plugin version 0
algo-1:126:199 [5] NCCL INFO Using network AWS Libfabric
algo-1:128:197 [6] NCCL INFO NET/OFI Selected Provider is efa
algo-1:128:197 [6] NCCL INFO Using non-device net plugin version 0
algo-1:128:197 [6] NCCL INFO Using network AWS Libfabric
algo-1:122:122 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
algo-1:122:122 [3] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:122:122 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:122:122 [3] NCCL INFO NET/OFI Selected Provider is efa
algo-1:122:122 [3] NCCL INFO Using non-device net plugin version 0
algo-1:122:122 [3] NCCL INFO Using network AWS Libfabric
algo-1:124:209 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
algo-1:124:209 [4] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:124:209 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:124:209 [4] NCCL INFO NET/OFI Selected Provider is efa
algo-1:124:209 [4] NCCL INFO Using non-device net plugin version 0
algo-1:124:209 [4] NCCL INFO Using network AWS Libfabric
algo-1:116:211 [2] NCCL INFO DMA-BUF is available on GPU device 2
algo-1:113:113 [1] NCCL INFO DMA-BUF is available on GPU device 1
algo-1:129:205 [7] NCCL INFO DMA-BUF is available on GPU device 7
algo-1:126:199 [5] NCCL INFO DMA-BUF is available on GPU device 5
algo-1:128:197 [6] NCCL INFO DMA-BUF is available on GPU device 6
algo-1:122:122 [3] NCCL INFO DMA-BUF is available on GPU device 3
algo-1:124:209 [4] NCCL INFO DMA-BUF is available on GPU device 4
algo-1:129:205 [7] NCCL INFO comm 0x7f7d94f30600 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId a01d0 commId 0x3421509e5ba66223 - Init START
algo-1:128:197 [6] NCCL INFO comm 0x7f7a78f313a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId a01c0 commId 0x3421509e5ba66223 - Init START
algo-1:133:133 [0] NCCL INFO comm 0x55bc7e222630 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 101c0 commId 0x3421509e5ba66223 - Init START
algo-1:126:199 [5] NCCL INFO comm 0x7f29ecf30790 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 901d0 commId 0x3421509e5ba66223 - Init START
algo-1:113:113 [1] NCCL INFO comm 0x563ffd80ffc0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 101d0 commId 0x3421509e5ba66223 - Init START
algo-1:122:122 [3] NCCL INFO comm 0x564794a485a0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 201d0 commId 0x3421509e5ba66223 - Init START
algo-1:124:209 [4] NCCL INFO comm 0x7fa600f314c0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 901c0 commId 0x3421509e5ba66223 - Init START
algo-1:116:211 [2] NCCL INFO comm 0x7f69d8f30e80 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 201c0 commId 0x3421509e5ba66223 - Init START
algo-1:129:205 [7] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:128:197 [6] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:116:211 [2] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:122:122 [3] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:124:209 [4] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:133:133 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:126:199 [5] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:113:113 [1] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/conda/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
algo-1:129:205 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
algo-1:129:205 [7] NCCL INFO NVLS multicast support is not available on dev 7
algo-1:126:199 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
algo-1:126:199 [5] NCCL INFO NVLS multicast support is not available on dev 5
algo-1:128:197 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
algo-1:128:197 [6] NCCL INFO NVLS multicast support is not available on dev 6
algo-1:113:113 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
algo-1:113:113 [1] NCCL INFO NVLS multicast support is not available on dev 1
algo-1:116:211 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
algo-1:116:211 [2] NCCL INFO NVLS multicast support is not available on dev 2
algo-1:124:209 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
algo-1:124:209 [4] NCCL INFO NVLS multicast support is not available on dev 4
algo-1:122:122 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
algo-1:122:122 [3] NCCL INFO NVLS multicast support is not available on dev 3
algo-1:133:133 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
algo-1:133:133 [0] NCCL INFO NVLS multicast support is not available on dev 0
algo-1:133:133 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7
algo-1:113:113 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
algo-1:113:113 [1] NCCL INFO P2P Chunksize set to 524288
algo-1:116:211 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
algo-1:116:211 [2] NCCL INFO P2P Chunksize set to 524288
algo-1:122:122 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
algo-1:122:122 [3] NCCL INFO P2P Chunksize set to 524288
algo-1:133:133 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7
algo-1:126:199 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
algo-1:126:199 [5] NCCL INFO P2P Chunksize set to 524288
algo-1:124:209 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
algo-1:124:209 [4] NCCL INFO P2P Chunksize set to 524288
algo-1:128:197 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
algo-1:128:197 [6] NCCL INFO P2P Chunksize set to 524288
algo-1:129:205 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
algo-1:129:205 [7] NCCL INFO P2P Chunksize set to 524288
algo-1:133:133 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7
algo-1:133:133 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
algo-1:133:133 [0] NCCL INFO P2P Chunksize set to 524288
algo-1:113:113 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Connected all rings
algo-1:113:113 [1] NCCL INFO Connected all rings
algo-1:133:133 [0] NCCL INFO Connected all rings
algo-1:129:205 [7] NCCL INFO Connected all rings
algo-1:129:205 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Connected all rings
algo-1:129:205 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Connected all rings
algo-1:129:205 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Connected all rings
algo-1:129:205 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Connected all rings
algo-1:116:211 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 10/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 11/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 12/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 13/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 14/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 16/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 17/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 18/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 19/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 20/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 21/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 22/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:129:205 [7] NCCL INFO Channel 23/0 : 7[7] -> 6[6] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 20/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 22/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:116:211 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 16/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 17/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 18/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 19/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 20/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 21/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 22/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 16/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:128:197 [6] NCCL INFO Channel 23/0 : 6[6] -> 5[5] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 17/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 18/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 16/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 19/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 17/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:113:113 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 20/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 18/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 16/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 21/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 19/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 22/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 20/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 17/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:124:209 [4] NCCL INFO Channel 23/0 : 4[4] -> 3[3] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 21/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 18/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 22/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:126:199 [5] NCCL INFO Channel 23/0 : 5[5] -> 4[4] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 19/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 20/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 21/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 22/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:122:122 [3] NCCL INFO Channel 23/0 : 3[3] -> 2[2] via P2P/CUMEM/read
algo-1:133:133 [0] NCCL INFO Connected all trees
algo-1:133:133 [0] NCCL INFO NCCL_PROTO set by environment to simple
algo-1:133:133 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
algo-1:133:133 [0] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
algo-1:113:113 [1] NCCL INFO Connected all trees
algo-1:113:113 [1] NCCL INFO NCCL_PROTO set by environment to simple
algo-1:113:113 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
algo-1:113:113 [1] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
algo-1:116:211 [2] NCCL INFO Connected all trees
algo-1:116:211 [2] NCCL INFO NCCL_PROTO set by environment to simple
algo-1:116:211 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
algo-1:116:211 [2] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
algo-1:129:205 [7] NCCL INFO Connected all trees
algo-1:129:205 [7] NCCL INFO NCCL_PROTO set by environment to simple
algo-1:129:205 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
algo-1:129:205 [7] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
algo-1:122:122 [3] NCCL INFO Connected all trees
algo-1:122:122 [3] NCCL INFO NCCL_PROTO set by environment to simple
algo-1:122:122 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
algo-1:122:122 [3] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
algo-1:124:209 [4] NCCL INFO Connected all trees
algo-1:124:209 [4] NCCL INFO NCCL_PROTO set by environment to simple
algo-1:124:209 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
algo-1:124:209 [4] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
algo-1:128:197 [6] NCCL INFO Connected all trees
algo-1:128:197 [6] NCCL INFO NCCL_PROTO set by environment to simple
algo-1:126:199 [5] NCCL INFO Connected all trees
algo-1:128:197 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
algo-1:128:197 [6] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
algo-1:126:199 [5] NCCL INFO NCCL_PROTO set by environment to simple
algo-1:126:199 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
algo-1:126:199 [5] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
algo-1:122:122 [3] NCCL INFO comm 0x564794a485a0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 201d0 commId 0x3421509e5ba66223 - Init COMPLETE
algo-1:116:211 [2] NCCL INFO comm 0x7f69d8f30e80 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 201c0 commId 0x3421509e5ba66223 - Init COMPLETE
algo-1:113:113 [1] NCCL INFO comm 0x563ffd80ffc0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 101d0 commId 0x3421509e5ba66223 - Init COMPLETE
algo-1:129:205 [7] NCCL INFO comm 0x7f7d94f30600 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId a01d0 commId 0x3421509e5ba66223 - Init COMPLETE
algo-1:133:133 [0] NCCL INFO comm 0x55bc7e222630 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 101c0 commId 0x3421509e5ba66223 - Init COMPLETE
algo-1:124:209 [4] NCCL INFO comm 0x7fa600f314c0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 901c0 commId 0x3421509e5ba66223 - Init COMPLETE
algo-1:128:197 [6] NCCL INFO comm 0x7f7a78f313a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId a01c0 commId 0x3421509e5ba66223 - Init COMPLETE
algo-1:126:199 [5] NCCL INFO comm 0x7f29ecf30790 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 901d0 commId 0x3421509e5ba66223 - Init COMPLETE
Running smdistributed.dataparallel v1.8.0
SMDDP: Single node mode
algo-1:128:237 [6] NCCL INFO [Service thread] Connection closed by localRank 5
algo-1:128:237 [6] NCCL INFO [Service thread] Connection closed by localRank 7"
Command "mpirun --host algo-1 -np 8 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 1 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_SINGLENODE=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.10/site-packages/gethostname.cpython-310-x86_64-linux-gnu.so -x NCCL_PROTO=simple -x FI_EFA_USE_DEVICE_RDMA=1 smddprun /opt/conda/bin/python3.10 -m mpi4py train_vlcm_distill_lcm_wds.py --adam_weight_decay 0 --checkpointing_steps 200 --checkpoints_total_limit 10 --dataloader_num_workers 8 --ema_decay 0.95 --enable_xformers_memory_efficient_attention True --gradient_accumulation_steps 1 --gradient_checkpointing True --learning_rate 1e-06 --loss_type huber --max_train_samples 10727607 --max_train_steps 10727607 --mixed_precision fp16 --pretrained_teacher_model damo-vilab/text-to-video-ms-1.7b --resolution 512 --resume_from_checkpoint latest --seed 453645634 --train_batch_size 16 --use_8bit_adam True --validation_steps 200"
2024-03-05 07:28:26,194 sagemaker-training-toolkit ERROR Encountered exit_code 1
2024-03-05 07:29:19 Uploading - Uploading generated training model
2024-03-05 07:29:19 Failed - Training job failed
Traceback (most recent call last):
File "/home/rohit.bharadwaj/.conda/envs/LCM/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1021, in launch_command
sagemaker_launcher(defaults, args)
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/accelerate/commands/launch.py", line 840, in sagemaker_launcher
huggingface_estimator.fit(inputs=sagemaker_inputs)
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/workflow/pipeline_context.py", line 346, in wrapper
return run_func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/estimator.py", line 1341, in fit
self.latest_training_job.wait(logs=logs)
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/estimator.py", line 2677, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/session.py", line 5568, in logs_for_job
_logs_for_job(self, job_name, wait, poll, log_type, timeout)
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/session.py", line 7711, in _logs_for_job
_check_job_status(job_name, description, "TrainingJobStatus")
File "/home/rohit.bharadwaj/.conda/envs/LCM/lib/python3.11/site-packages/sagemaker/session.py", line 7764, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job accelerate-sagemaker-1-2024-03-05-07-15-53-204: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise RuntimeError("""
RuntimeError
Couldn't initialize SMDDP.
Expected mechanism for checking for NCCL backend has changed.
Expected defintion for _check_for_nccl_backend in distributed_c10d. Found None.
Traceback (most recent call last)
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/mpi4py/__main__.py", line 7, in <module>
main()
File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 198, in main
run_command_line(args)
File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 47, in run_command_line
run_path(sys.argv[0], run_name='__main__')
File "/opt/conda/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/opt/conda/lib/python3.10/runpy.py",
DLC image/dockerfile: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04
Current behavior: shown in the logs above
Expected behavior: Model should train on multiple GPUs.
Additional context:
I think this issue is related to Pytorch version 2.2.0 or CUDA 12. One of my dependency (xformers) was forcing the install of latest pytorch version, and was causing this issue. I hope this can be fixed.