Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch)
I am trying to get multi-node inference working with 4 nodes, each with 4xRTX8000 GPUs (48GB per GPU).
deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom.
The script finishes loading all the checkpoints and begins inference but then quickly runs into the following error:
...
gr061: loading checkpoint (68)
gr061: loading checkpoint (69)
gr061: loading checkpoint (70)
gr063: [2022-07-20 19:03:10,723] [INFO] [engine.py:144:__init__] Place model to device: 0
gr061: loading checkpoint (71)
gr061: [2022-07-20 19:03:11,443] [INFO] [engine.py:144:__init__] Place model to device: 0
gr061: *** Starting to generate 100 tokens with bs=1
gr061: Generate args {'max_new_tokens': 100, 'do_sample': False}
gr064: [2022-07-20 19:03:12,551] [INFO] [engine.py:144:__init__] Place model to device: 3
gr061: [2022-07-20 19:03:13,294] [INFO] [engine.py:144:__init__] Place model to device: 3
gr062: [2022-07-20 19:03:14,244] [INFO] [engine.py:144:__init__] Place model to device: 2
gr062: [2022-07-20 19:03:14,406] [INFO] [engine.py:144:__init__] Place model to device: 0
gr063: [2022-07-20 19:03:14,791] [INFO] [engine.py:144:__init__] Place model to device: 2
gr064: [2022-07-20 19:03:15,444] [INFO] [engine.py:144:__init__] Place model to device: 2
gr061: [2022-07-20 19:03:15,542] [INFO] [engine.py:144:__init__] Place model to device: 2
gr061: [2022-07-20 19:03:15,618] [INFO] [engine.py:144:__init__] Place model to device: 1
gr062: [2022-07-20 19:03:16,179] [INFO] [engine.py:144:__init__] Place model to device: 3
gr062: [2022-07-20 19:03:16,513] [INFO] [engine.py:144:__init__] Place model to device: 1
gr064: [2022-07-20 19:03:16,777] [INFO] [engine.py:144:__init__] Place model to device: 0
gr064: [2022-07-20 19:03:17,541] [INFO] [engine.py:144:__init__] Place model to device: 1
gr063: [2022-07-20 19:03:18,336] [INFO] [engine.py:144:__init__] Place model to device: 3
gr063: [2022-07-20 19:03:18,547] [INFO] [engine.py:144:__init__] Place model to device: 1
gr064: Traceback (most recent call last):
gr064: File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
gr064: _ = generate()
gr064: File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
gr064: outputs = model.generate(**input_tokens, **generate_kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
gr064: Traceback (most recent call last):
gr064: File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
gr064: _ = generate()
gr064: File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
gr064: outputs = model.generate(**input_tokens, **generate_kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
gr064: Traceback (most recent call last):
gr064: File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
gr064: _ = generate()
gr064: File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
gr064: outputs = model.generate(**input_tokens, **generate_kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
gr064: return func(*args, **kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064: return func(*args, **kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064: return func(*args, **kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064: return func(*args, **kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064: outputs = self(
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: outputs = self(
gr064: outputs = self(
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: outputs = self(
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064: return forward_call(*input, **kwargs)return forward_call(*input, **kwargs)
gr064:
gr064: File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064: File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064: outputs = self.model_orig_fwd(*inputs, **kwargs)outputs = self.model_orig_fwd(*inputs, **kwargs)
gr064:
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064: outputs = self.model_orig_fwd(*inputs, **kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064: outputs = self.model_orig_fwd(*inputs, **kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064: transformer_outputs = self.transformer(transformer_outputs = self.transformer(transformer_outputs = self.transformer(transformer_outputs = self.transformer(
gr064:
gr064:
gr064:
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: return forward_call(*input, **kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064: return forward_call(*input, **kwargs)return forward_call(*input, **kwargs)
gr064:
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064: return forward_call(*input, **kwargs)
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064: outputs = block(
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: outputs = block(
gr064: outputs = block(
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: outputs = block(
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064: return forward_call(*input, **kwargs)
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064: self.attention(input,
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: self.attention(input,
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: self.attention(input,
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: self.attention(input,
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064: return forward_call(*input, **kwargs)
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064: output = DeepSpeedSelfAttentionFunction.apply(
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064: output = DeepSpeedSelfAttentionFunction.apply(
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064: output = DeepSpeedSelfAttentionFunction.apply(
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064: output = DeepSpeedSelfAttentionFunction.apply(
gr064: File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064: dist.all_reduce(output, group=mp_group)
gr064: File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064: dist.all_reduce(output, group=mp_group)
gr064: File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064: dist.all_reduce(output, group=mp_group)
gr064: File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064: dist.all_reduce(output, group=mp_group)
gr064: File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064: return cdb.all_reduce(tensor, op, group, async_op)
gr064: File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064: return cdb.all_reduce(tensor, op, group, async_op)
gr064: File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064: return torch.distributed.all_reduce(tensor=tensor,
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064: return torch.distributed.all_reduce(tensor=tensor,
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064: return cdb.all_reduce(tensor, op, group, async_op)
gr064: File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064: return torch.distributed.all_reduce(tensor=tensor,
gr064: return cdb.all_reduce(tensor, op, group, async_op) File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064:
gr064: File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064: return torch.distributed.all_reduce(tensor=tensor,
gr064: File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064: work = group.allreduce([tensor], opts)
gr064: work = group.allreduce([tensor], opts)
gr064: work = group.allreduce([tensor], opts)
gr064: RuntimeErrorRuntimeError: : NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.RuntimeError
gr064: work = group.allreduce([tensor], opts)
gr064: :
gr064: NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.
gr064: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064: what(): CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fae5f70b477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7fae8ccfc4a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7fae8cd02417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7fae9f4f0c68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fae5f6eed95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7fae9f3e5b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7fae9f719fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7fae9f71a2c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x55ccd72e1e28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x55ccd72eead8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x55ccd73027ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x55ccd72d6661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x55ccd72dc81a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x55ccd73ceaec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x55ccd73cdf56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x55ccd73c12b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x55ccd7393b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7faee4a9a0b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x55ccd7393a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064: what(): CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f183ee2a477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7f186c41b4a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f186c421417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7f187ec0fc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f183ee0dd95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7f187eb04b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7f187ee38fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f187ee392c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x5616533d4e28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x5616533e1ad8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x5616533f57ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x5616533c9661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x5616533cf81a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x5616534c1aec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x5616534c0f56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x5616534b42b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x561653486b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7f18c41b90b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x561653486a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064: what(): CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb213ab8477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7fb2410a94a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7fb2410af417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7fb25389dc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fb213a9bd95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7fb253792b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7fb253ac6fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7fb253ac72c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x5616125aee28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x5616125bbad8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x5616125cf7ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x5616125a3661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x5616125a981a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x56161269baec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x56161269af56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x56161268e2b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x561612660b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7fb298e470b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x561612660a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064: what(): CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8724e9e477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7f875248f4a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f8752495417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7f8764c83c68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f8724e81d95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7f8764b78b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7f8764eacfc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f8764ead2c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x5640a0321e28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x5640a032ead8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x5640a03427ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x5640a0316661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x5640a031c81a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x5640a040eaec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x5640a040df56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x5640a04012b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x5640a03d3b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7f87aa22d0b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x5640a03d3a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: [2022-07-20 19:03:32,219] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678791
gr064: [2022-07-20 19:03:32,220] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678792
gr064: [2022-07-20 19:03:32,220] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678793
gr064: [2022-07-20 19:03:32,220] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678794
gr064: [2022-07-20 19:03:32,220] [ERROR] [launch.py:184:sigkill_handler] ['/ext3/miniconda3/bin/python3.9', '-u', 'Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py', '--local_rank=3', '--name', 'bigscience/bloom'] exits with return code = -6
pdsh@gr061: gr064: ssh exited with exit code 250
pdsh@gr061: gr062: ssh exited with exit code 250
pdsh@gr061: gr061: ssh exited with exit code 250
I've tried with CUDA 10.2 and 11.6 and there's no difference.
@stas00
Yeah, I get that too when I try to load too much of a batch size. But if you're running my script its default is bs=1 so shouldn't really be a problem. I haven't tried it on your setup. But the issue is on the DS-Inference side.
@RezaYazdaniAminabadi, as you can see both I and many others run into this issue - could we change the kernel code to be more defensive? It's always the same group.allreduce([tensor], opts) where it happens.
Hi @stas00 ,
Thanks for tagging me here. I will definitely look into this and try to fix it soon.
Best, Reza
@asaparov, please run the following 2 experiments
- same set up as your but add:
CUDA_LAUNCH_BLOCKING=1as in:
CUDA_LAUNCH_BLOCKING=1 deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom
and let's see if it starts working
- does it fail in the same way if you use "bigscience/bloom-1b3" - just to check that it's the issue with size and not the setup/system. But don't use
CUDA_LAUNCH_BLOCKING=1this time. That is:
deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom-1b3
Thank you!
@stas00 It seems to be working with CUDA_LAUNCH_BLOCKING=1!
I'll test with bigscience/bloom-1b3 next.
Thank you for reporting back, @asaparov! You may use this way for now, it will be just a tad slower, until the underlying issue is resolved. The difficulty is in reproducing it.
@RezaYazdaniAminabadi, so @asaparov's success with CUDA_LAUNCH_BLOCKING=1 is pointing to some unsynchronized code in the kernels. As I proposed yesterday.
@stas00 Actually I just tested both bigscience/bloom and bigscience/bloom-1b3 without CUDA_LAUNCH_BLOCKING=1 and they both work. This is probably because I pulled newer code from the bloom-inference branch of this repo (commit b76e516) and the code from the ds-inference/bloom-fix branch of DeepSpeed (commit f39c78f).
I had to fix a few bugs related to save_mp_checkpoint_path being set to False instead of None, but everything seems to work fine after that.
I suspect that the bug is intermittent as it pops up in various situations and inconsistent. But if it works at the moment for you that's great!
Yes, the save_mp_checkpoint_path was just added and still being fixed up.
It basically allows you to set the tp-sharded path and then it'll save the new checkpoint - and the load time from it will be 1-2min instead of 10-20min. You may want to give it a try.
once the checkpoint is created you need to set parallelization="tp".
the 2 new changes are, the addition of save_mp_checkpoint_path to save the tp sharded weights on init.
kwargs["save_mp_checkpoint_path"] = checkpoint_dir
#checkpoints_json=None
model = deepspeed.init_inference(model,
mp_size=world_size,
dtype=torch.half,
checkpoint=checkpoints_json,
**kwargs,
)
and the addition of parallelization in the checkpoint json format
checkpoint_type = "tp"
checkpoint_dir = "/home/nicolas_huggingface_co/src/Megatron-DeepSpeed/bloom-tp"
checkpoint_files = glob.glob(f"{checkpoint_dir}/*pt")
if len(checkpoint_files) == 0:
# hf checkpoint
checkpoint_files = get_checkpoint_files(model_name)
checkpoint_type = "pp" # normal hf hub checkpoint
if rank == 0:
print("Checkpoint files:", checkpoint_files)
print("Checkpoint type:", checkpoint_type)
checkpoints_json = "checkpoints.json"
def write_checkponts_json():
with io.open(checkpoints_json, 'w', encoding='utf-8') as f:
data = {
"type": "BLOOM-176B",
"checkpoints": checkpoint_files,
"version": 1.0,
"parallelization": checkpoint_type,
}
the 2 values are pp (normal hf checkpoint) and tp tp-sharded checkpoint.
I will make it all configurable once the dust settles.
Hi @asaparov
It's great to see your issue is solved. As @stas00 mentioned the part regarding the new checkpoint loading is coming soon too. @stas00, thanks for full details here :)
Best, Reza
@stas00 Actually I just tested both
bigscience/bloomandbigscience/bloom-1b3without CUDA_LAUNCH_BLOCKING=1 and they both work. This is probably because I pulled newer code from thebloom-inferencebranch of this repo (commit b76e516) and the code from theds-inference/bloom-fixbranch of DeepSpeed (commit f39c78f).I had to fix a few bugs related to
save_mp_checkpoint_pathbeing set toFalseinstead ofNone, but everything seems to work fine afterthat.
@asaparov Can you share your code for inference BLOOM or give me an idea on which inference repo did you use and did you make any code modification? I have the same hardware requirements as yours but I can’t get rid of CUDA errors even adding CUDA_LAUNCH_BLOCKING=1. I used the inference code on branch bloom-inference and DeepSpeed branch ds-inference/bloom-fix. Also did you set the environment variable WORLD_SIZE?
@pai4451 I didn't change any code from this repo at all. I followed the installation instructions in the readme. I invoke the inference script using:
deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom
I'm running everything in a conda environment in a singularity container. The output of conda info is:
Singularity> conda list
# packages in environment at /ext3/miniconda3:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_kmp_llvm conda-forge
absl-py 1.2.0 pypi_0 pypi
aiohttp 3.8.1 pypi_0 pypi
aiosignal 1.2.0 pypi_0 pypi
apex 0.1 pypi_0 pypi
appdirs 1.4.4 pypi_0 pypi
async-timeout 4.0.2 pypi_0 pypi
attrs 21.4.0 pypi_0 pypi
black 21.4b0 pypi_0 pypi
blas 2.115 mkl conda-forge
blas-devel 3.9.0 15_linux64_mkl conda-forge
brotlipy 0.7.0 py39hb9d737c_1004 conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
ca-certificates 2022.6.15 ha878542_0 conda-forge
cachetools 5.2.0 pypi_0 pypi
certifi 2022.6.15 py39hf3d152e_0 conda-forge
cffi 1.15.1 py39he91dace_0 conda-forge
charset-normalizer 2.1.0 pyhd8ed1ab_0 conda-forge
click 8.1.3 pypi_0 pypi
colorama 0.4.5 pyhd8ed1ab_0 conda-forge
conda 4.13.0 py39hf3d152e_1 conda-forge
conda-package-handling 1.8.1 py39hb9d737c_1 conda-forge
cryptography 37.0.4 py39hd97740a_0 conda-forge
cudatoolkit 11.6.0 hecad31d_10 conda-forge
datasets 2.4.0 pypi_0 pypi
deepspeed 0.7.0+f39c78f9 dev_0 <develop>
dill 0.3.5.1 pypi_0 pypi
filelock 3.7.1 pypi_0 pypi
frozenlist 1.3.0 pypi_0 pypi
fsspec 2022.5.0 pypi_0 pypi
google-auth 2.9.1 pypi_0 pypi
google-auth-oauthlib 0.4.6 pypi_0 pypi
grpcio 1.47.0 pypi_0 pypi
hjson 3.0.2 pypi_0 pypi
huggingface-hub 0.8.1 pypi_0 pypi
idna 3.3 pyhd8ed1ab_0 conda-forge
importlib-metadata 4.12.0 pypi_0 pypi
isort 5.10.1 pypi_0 pypi
joblib 1.1.0 pypi_0 pypi
ld_impl_linux-64 2.36.1 hea4e1c9_2 conda-forge
libaio 0.3.113 h5eee18b_0 <unknown>
libblas 3.9.0 15_linux64_mkl conda-forge
libcblas 3.9.0 15_linux64_mkl conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 12.1.0 h8d9b700_16 conda-forge
libgfortran-ng 12.1.0 h69a702a_16 conda-forge
libgfortran5 12.1.0 hdcd56e2_16 conda-forge
libgomp 12.1.0 h8d9b700_16 conda-forge
liblapack 3.9.0 15_linux64_mkl conda-forge
liblapacke 3.9.0 15_linux64_mkl conda-forge
libnsl 2.0.0 h7f98852_0 conda-forge
libstdcxx-ng 12.1.0 ha89aaad_16 conda-forge
libuuid 2.32.1 h7f98852_1000 conda-forge
libzlib 1.2.12 h166bdaf_2 conda-forge
llvm-openmp 14.0.4 he0ac6c6_0 conda-forge
markdown 3.4.1 pypi_0 pypi
markupsafe 2.1.1 pypi_0 pypi
mkl 2022.1.0 h84fe81f_915 conda-forge
mkl-devel 2022.1.0 ha770c72_916 conda-forge
mkl-include 2022.1.0 h84fe81f_915 conda-forge
multidict 6.0.2 pypi_0 pypi
multiprocess 0.70.13 pypi_0 pypi
mypy-extensions 0.4.3 pypi_0 pypi
ncurses 6.3 h27087fc_1 conda-forge
ninja 1.10.2.3 pypi_0 pypi
nltk 3.7 pypi_0 pypi
numpy 1.23.1 pypi_0 pypi
oauthlib 3.2.0 pypi_0 pypi
openssl 1.1.1q h166bdaf_0 conda-forge
packaging 21.3 pypi_0 pypi
pandas 1.4.3 pypi_0 pypi
parameterized 0.8.1 pypi_0 pypi
pathspec 0.9.0 pypi_0 pypi
pip 22.2 pyhd8ed1ab_0 conda-forge
protobuf 3.19.4 pypi_0 pypi
psutil 5.9.1 pypi_0 pypi
py-cpuinfo 8.0.0 pypi_0 pypi
pyarrow 8.0.0 pypi_0 pypi
pyasn1 0.4.8 pypi_0 pypi
pyasn1-modules 0.2.8 pypi_0 pypi
pybind11 2.10.0 pypi_0 pypi
pycosat 0.6.3 py39hb9d737c_1010 conda-forge
pycparser 2.21 pyhd8ed1ab_0 conda-forge
pydantic 1.9.1 pypi_0 pypi
pyopenssl 22.0.0 pyhd8ed1ab_0 conda-forge
pyparsing 3.0.9 pypi_0 pypi
pysocks 1.7.1 py39hf3d152e_5 conda-forge
python 3.9.13 h9a8a25e_0_cpython conda-forge
python-dateutil 2.8.2 pypi_0 pypi
python_abi 3.9 2_cp39 conda-forge
pytorch 1.12.0 py3.9_cuda11.6_cudnn8.3.2_0 pytorch
pytorch-mutex 1.0 cuda pytorch
pytz 2022.1 pypi_0 pypi
pyyaml 6.0 pypi_0 pypi
readline 8.1.2 h0f457ee_0 conda-forge
regex 2022.7.25 pypi_0 pypi
requests 2.28.1 pyhd8ed1ab_0 conda-forge
requests-oauthlib 1.3.1 pypi_0 pypi
responses 0.18.0 pypi_0 pypi
rsa 4.9 pypi_0 pypi
ruamel_yaml 0.15.80 py39hb9d737c_1007 conda-forge
setuptools 63.2.0 py39hf3d152e_0 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
sqlite 3.39.2 h4ff8645_0 conda-forge
tbb 2021.5.0 h924138e_1 conda-forge
tensorboard 2.9.1 pypi_0 pypi
tensorboard-data-server 0.6.1 pypi_0 pypi
tensorboard-plugin-wit 1.8.1 pypi_0 pypi
tk 8.6.12 h27826a3_0 conda-forge
tokenizers 0.12.1 pypi_0 pypi
toml 0.10.2 pypi_0 pypi
tqdm 4.64.0 pyhd8ed1ab_0 conda-forge
transformers 4.20.1 pypi_0 pypi
typing_extensions 4.3.0 pyha770c72_0 conda-forge
tzdata 2022a h191b570_0 conda-forge
urllib3 1.26.11 pyhd8ed1ab_0 conda-forge
werkzeug 2.2.0 pypi_0 pypi
wheel 0.37.1 pyhd8ed1ab_0 conda-forge
xxhash 3.0.0 pypi_0 pypi
xz 5.2.5 h516909a_1 conda-forge
yaml 0.2.5 h7f98852_2 conda-forge
yarl 1.7.2 pypi_0 pypi
zipp 3.8.1 pypi_0 pypi
zlib 1.2.12 h166bdaf_2 conda-forge
For this repo and deepspeed, I'm using the commits that I mention above. I had a few errors from deepspeed complaining about save_mp_checkpoint_path which I fixed with the following changes:
diff --git a/deepspeed/__init__.py b/deepspeed/__init__.py
index 655d7a96..50049a2a 100755
--- a/deepspeed/__init__.py
+++ b/deepspeed/__init__.py
@@ -239,7 +239,7 @@ def init_inference(model,
moe_type='standard',
args=None,
enable_cuda_graph=False,
- save_mp_checkpoint_path=False):
+ save_mp_checkpoint_path=None):
"""Initialize the DeepSpeed InferenceEngine.
Arguments:
diff --git a/deepspeed/inference/engine.py b/deepspeed/inference/engine.py
index b5841dab..f380cd21 100755
--- a/deepspeed/inference/engine.py
+++ b/deepspeed/inference/engine.py
@@ -50,7 +50,7 @@ class InferenceEngine(Module):
moe_type='standard',
config=None,
enable_cuda_graph=False,
- save_mp_checkpoint_path=False):
+ save_mp_checkpoint_path=None):
"""
Args:
model: torch.nn.Module
@@ -322,7 +322,7 @@ class InferenceEngine(Module):
moe_type='standard',
training_mp_size=1,
checkpoint_dir=None,
- save_mp_checkpoint_path=False):
+ save_mp_checkpoint_path=None):
checkpoint, ckpt_type = SDLoaderFactory.get_sd_loader_json(
checkpoint_dir) if checkpoint_dir is not None else (None, None)
replace_transformer_layer(client_module,
I also had to make a few other edits to deepspeed since I wanted each worker to run within the singularity container, and to prevent ssh from complaining about host key authentication (I'm running this on a cluster).
@asaparov Thanks for the details. I can finally inference BLOOM with DeepSpeed on multiple nodes now. However, it only works for batch_size=1, and when I increase the batch size, error message RuntimeError: CUDA error: an illegal memory access was encountered throw out again. Do you have the same issue or can you inference with batch size more than 1 on you side? Thank you.
Hmm, its not working for me even within a single node with batch size = 1, 8x A100 80gb Same, CUDA illegal memory access error
Hmm, its not working for me even within a single node with batch size = 1, 8x A100 80gb Same, CUDA illegal memory access error
See if "NCCL WARN Call to ibv_reg_reg_mr failed" appearing on your log. In my case, we modify /etc/security/limits.conf to resolve it. you could find detail here. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html
But also only work for batch size == 1
@pohunghuang-nctu nothing like that in my logs This is the full trace
[2022-07-26 11:41:08,472] [WARNING] [runner.py:159:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-07-26 11:41:11,508] [INFO] [runner.py:457:main] cmd = /net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --benchmark
[2022-07-26 11:41:12,431] [INFO] [launch.py:96:main] 0 NCCL_IB_DISABLE=1
[2022-07-26 11:41:12,431] [INFO] [launch.py:96:main] 0 NCCL_DEBUG=INFO
[2022-07-26 11:41:12,431] [INFO] [launch.py:103:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2022-07-26 11:41:12,431] [INFO] [launch.py:109:main] nnodes=1, num_local_procs=8, node_rank=0
[2022-07-26 11:41:12,431] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2022-07-26 11:41:12,431] [INFO] [launch.py:123:main] dist_world_size=8
[2022-07-26 11:41:12,431] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2022-07-26 11:41:13,715] [INFO] [comm.py:423:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model bigscience/bloom
[2022-07-26 11:41:22,608] [INFO] [utils.py:827:see_memory_usage] pre-from-pretrained
[2022-07-26 11:41:22,608] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2022-07-26 11:41:22,608] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 11.2 GB, percent = 0.9%
[2022-07-26 11:41:22,745] [INFO] [utils.py:827:see_memory_usage] post-from-pretrained
[2022-07-26 11:41:22,746] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2022-07-26 11:41:22,746] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 11.21 GB, percent = 0.9%
[2022-07-26 11:41:22,795] [INFO] [utils.py:827:see_memory_usage] post-init-ds-zero-init
[2022-07-26 11:41:22,795] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2022-07-26 11:41:22,796] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 11.27 GB, percent = 0.9%
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.6
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO Using network Socket
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO Using network Socket
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO Using network Socket
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO Using network Socket
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO Using network Socket
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO Using network Socket
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO Using network Socket
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 00 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 01 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 02 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 03 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 04 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 05 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 00 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 06 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 01 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 00 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 07 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 00 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 02 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 01 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 08 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 01 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 03 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 02 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 09 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 02 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 04 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 03 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 10 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 03 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 05 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 04 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 11 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 04 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 06 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 05 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 12 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 05 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 00 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 07 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 06 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 13 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 06 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 01 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 08 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 07 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 14 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 07 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 02 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 09 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 08 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 15 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 00 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 08 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 03 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 10 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 09 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 16 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 01 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 09 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 04 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 11 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 10 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 17 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 10 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 02 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 05 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 00 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 12 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 11 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 18 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 11 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 03 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 06 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 01 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 13 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 12 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 19 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 00 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 12 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 04 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 07 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 02 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 14 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 13 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 20 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 01 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 13 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 05 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 08 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 03 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 15 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 14 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 21 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 02 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 14 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 06 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 09 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 04 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 16 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 15 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 22 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 03 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 15 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 07 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 10 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 05 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 17 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 16 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 23 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 04 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 16 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 08 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 11 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 06 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 18 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 17 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 05 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 17 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 09 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 12 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 07 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 19 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 18 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 06 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 18 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 10 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 13 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 08 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 20 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 19 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 07 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 19 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 14 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 11 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 09 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 21 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 20 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 08 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 20 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 15 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 12 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 10 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 22 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 21 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 09 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 21 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 16 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 13 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 11 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 23 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 22 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 10 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 22 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 17 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 14 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 12 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 23 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 11 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 23 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 18 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 15 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 13 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 12 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 19 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 14 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 16 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 13 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 15 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 20 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 17 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 14 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 16 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 21 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 18 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 15 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 17 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 22 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 19 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 16 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 18 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 23 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 20 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 17 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 19 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 21 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 18 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 20 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 22 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 19 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 21 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 23 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 22 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 20 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 23 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 21 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 22 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 23 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Connected all rings
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Connected all rings
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Connected all rings
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Connected all rings
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Connected all rings
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 00 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Connected all rings
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Connected all rings
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 01 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Connected all rings
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 02 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 03 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 04 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 05 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 06 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 07 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 08 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 09 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 10 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 11 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 12 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 13 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 14 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 15 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 16 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 17 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 18 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 00 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 19 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 01 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 20 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 02 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 21 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 03 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 22 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 04 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 23 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 00 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 00 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 05 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 01 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 01 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 06 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 02 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 00 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 02 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 07 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 00 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 03 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 03 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 01 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 08 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 04 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 01 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 04 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 02 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 09 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 05 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 02 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 00 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 03 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 05 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 10 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 06 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 03 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 01 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 06 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 04 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 11 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 07 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 04 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 02 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 07 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 05 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 12 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 08 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 05 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 03 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 06 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 08 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 13 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 09 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 06 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 04 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 07 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 09 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 14 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 10 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 07 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 05 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 08 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 15 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 10 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 11 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 08 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 06 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 16 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 09 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 12 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 11 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 09 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 07 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 17 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 13 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 10 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 12 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 10 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 08 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 14 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 18 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 13 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 11 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 11 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 09 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 19 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 15 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 14 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 12 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 12 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 10 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 16 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 20 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 15 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 13 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 13 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 11 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 17 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 21 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 16 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 14 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 14 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 12 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 18 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 22 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 17 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 15 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 15 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 13 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 19 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 23 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 18 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 16 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 16 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 14 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 20 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 19 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 17 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 17 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 15 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 21 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 20 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 18 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 18 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 16 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 22 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 21 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 19 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 19 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 17 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 23 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 22 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 20 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 18 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 20 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 23 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 21 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 19 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 21 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 22 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 20 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 22 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 23 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 21 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 23 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 22 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 23 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Connected all trees
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Connected all trees
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Connected all trees
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Connected all trees
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Connected all trees
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Connected all trees
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Connected all trees
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Connected all trees
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO comm 0x7f6890002fb0 rank 1 nranks 8 cudaDev 1 busId 4080 - Init COMPLETE
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO comm 0x7fbcc4002fb0 rank 4 nranks 8 cudaDev 4 busId 40b0 - Init COMPLETE
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO comm 0x7f0b9c002fb0 rank 2 nranks 8 cudaDev 2 busId 4090 - Init COMPLETE
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO comm 0x7f09a0002fb0 rank 6 nranks 8 cudaDev 6 busId 40d0 - Init COMPLETE
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO comm 0x7f61d0002fb0 rank 3 nranks 8 cudaDev 3 busId 40a0 - Init COMPLETE
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO comm 0x7fbd04002fb0 rank 0 nranks 8 cudaDev 0 busId 4070 - Init COMPLETE
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO Launch mode Parallel
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO comm 0x7f03dc002fb0 rank 5 nranks 8 cudaDev 5 busId 40c0 - Init COMPLETE
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO comm 0x7f1000002fb0 rank 7 nranks 8 cudaDev 7 busId 40e0 - Init COMPLETE
[2022-07-26 11:41:29,495] [INFO] [utils.py:827:see_memory_usage] pre-ds-inference-init
[2022-07-26 11:41:29,495] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2022-07-26 11:41:29,496] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 19.92 GB, percent = 1.6%
[2022-07-26 11:41:29,496] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.7.0+b6305d0e, git-hash=b6305d0e, git-branch=master
[2022-07-26 11:41:29,496] [INFO] [logging.py:69:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.25245213508605957 seconds
[2022-07-26 11:41:30,151] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 14336, 'intermediate_size': 57344, 'heads': 112, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 8, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': True}
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.2497098445892334 seconds
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.2436366081237793 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.24797964096069336 seconds
Time to load transformer_inference op: 0.24489784240722656 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.2467021942138672 seconds
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.24748826026916504 seconds
Time to load transformer_inference op: 0.24941658973693848 seconds
Loading 72 checkpoint shards: 0%| | 0/72 [11:08<?, ?it/s]9.89s/it]
[2022-07-26 11:52:39,789] [INFO] [engine.py:145:__init__] Place model to device: 6
Loading 72 checkpoint shards: 0%| | 0/72 [11:09<?, ?it/s]
[2022-07-26 11:52:39,989] [INFO] [engine.py:145:__init__] Place model to device: 1
Loading 72 checkpoint shards: 0%| | 0/72 [11:10<?, ?it/s]
[2022-07-26 11:52:41,127] [INFO] [engine.py:145:__init__] Place model to device: 3
Loading 72 checkpoint shards: 0%| | 0/72 [11:14<?, ?it/s]
[2022-07-26 11:52:45,432] [INFO] [engine.py:145:__init__] Place model to device: 5
Loading 72 checkpoint shards: 0%| | 0/72 [11:22<?, ?it/s]9.83s/it]
[2022-07-26 11:52:53,353] [INFO] [engine.py:145:__init__] Place model to device: 7
Loading 72 checkpoint shards: 0%| | 0/72 [11:24<?, ?it/s]
[2022-07-26 11:52:55,107] [INFO] [engine.py:145:__init__] Place model to device: 2
Loading 72 checkpoint shards: 100%|██████████| 72/72 [11:24<00:00, 9.51s/it]
[2022-07-26 11:52:55,582] [INFO] [engine.py:145:__init__] Place model to device: 0
[2022-07-26 11:52:55,707] [INFO] [utils.py:827:see_memory_usage] post-ds-inference-init
[2022-07-26 11:52:55,708] [INFO] [utils.py:828:see_memory_usage] MA 47.04 GB Max_MA 47.24 GB CA 47.04 GB Max_CA 47 GB
[2022-07-26 11:52:55,709] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 25.77 GB, percent = 2.0%
*** Starting to generate 100 tokens with bs=1
Generate args {'max_new_tokens': 100, 'do_sample': False}
Loading 72 checkpoint shards: 0%| | 0/72 [11:25<?, ?it/s]
[2022-07-26 11:52:56,613] [INFO] [engine.py:145:__init__] Place model to device: 4
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
llm-test-cluster-9:1281342:1283501 [1] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
_ = generate()
File "scripts/inference/bloom-ds-inference.py", line 244, in generate
outputs = model.generate(**input_tokens, **generate_kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
outputs = self.model_orig_fwd(*inputs, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
transformer_outputs = self.transformer(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
outputs = block(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
self.attention(input,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
dist.all_reduce(output, group=mp_group)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
llm-test-cluster-9:1281344:1283502 [3] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
_ = generate()
File "scripts/inference/bloom-ds-inference.py", line 244, in generate
outputs = model.generate(**input_tokens, **generate_kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
outputs = self.model_orig_fwd(*inputs, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
transformer_outputs = self.transformer(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
outputs = block(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
self.attention(input,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
dist.all_reduce(output, group=mp_group)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
llm-test-cluster-9:1281343:1283503 [2] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
_ = generate()
File "scripts/inference/bloom-ds-inference.py", line 244, in generate
outputs = model.generate(**input_tokens, **generate_kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
outputs = self.model_orig_fwd(*inputs, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
transformer_outputs = self.transformer(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
outputs = block(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
self.attention(input,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
dist.all_reduce(output, group=mp_group)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
llm-test-cluster-9:1281347:1283504 [6] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
_ = generate()
File "scripts/inference/bloom-ds-inference.py", line 244, in generate
outputs = model.generate(**input_tokens, **generate_kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
outputs = self.model_orig_fwd(*inputs, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
transformer_outputs = self.transformer(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
outputs = block(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
self.attention(input,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
dist.all_reduce(output, group=mp_group)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
llm-test-cluster-9:1281346:1283505 [5] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
_ = generate()
File "scripts/inference/bloom-ds-inference.py", line 244, in generate
outputs = model.generate(**input_tokens, **generate_kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
outputs = self.model_orig_fwd(*inputs, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
transformer_outputs = self.transformer(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
outputs = block(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
self.attention(input,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
dist.all_reduce(output, group=mp_group)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
llm-test-cluster-9:1281348:1283506 [7] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
_ = generate()
File "scripts/inference/bloom-ds-inference.py", line 244, in generate
outputs = model.generate(**input_tokens, **generate_kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
outputs = self.model_orig_fwd(*inputs, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
transformer_outputs = self.transformer(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
outputs = block(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
self.attention(input,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
dist.all_reduce(output, group=mp_group)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
llm-test-cluster-9:1281345:1283507 [4] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
_ = generate()
File "scripts/inference/bloom-ds-inference.py", line 244, in generate
outputs = model.generate(**input_tokens, **generate_kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
outputs = self.model_orig_fwd(*inputs, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
transformer_outputs = self.transformer(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
outputs = block(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
self.attention(input,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
dist.all_reduce(output, group=mp_group)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
llm-test-cluster-9:1281341:1283500 [0] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
_ = generate()
File "scripts/inference/bloom-ds-inference.py", line 244, in generate
outputs = model.generate(**input_tokens, **generate_kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
outputs = self.model_orig_fwd(*inputs, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
transformer_outputs = self.transformer(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
outputs = block(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
self.attention(input,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
dist.all_reduce(output, group=mp_group)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
I get the same error for batch size > 1, even with CUDA_LAUNCH_BLOCKING=1:
gr062: RuntimeError: CUDA error: an illegal memory access was encountered
gr062: terminate called after throwing an instance of 'c10::CUDAError'
gr062: what(): CUDA error: an illegal memory access was encountered
gr062: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr062: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7ad7777477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr062: frame #1: <unknown function> + 0x1d4a3 (0x7f7b04d684a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr062: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f7b04d6e417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr062: frame #3: <unknown function> + 0x458c68 (0x7f7b1755cc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f7ad775ad95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr062: frame #5: <unknown function> + 0x34db35 (0x7f7b17451b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #6: <unknown function> + 0x681fc8 (0x7f7b17785fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f7b177862c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #8: <unknown function> + 0x127e28 (0x55bbd032ae28 in /ext3/miniconda3/bin/python3.9)
gr062: frame #9: <unknown function> + 0x134ad8 (0x55bbd0337ad8 in /ext3/miniconda3/bin/python3.9)
gr062: frame #10: <unknown function> + 0x1487ce (0x55bbd034b7ce in /ext3/miniconda3/bin/python3.9)
gr062: frame #11: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #12: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #13: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #14: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #15: <unknown function> + 0x11c661 (0x55bbd031f661 in /ext3/miniconda3/bin/python3.9)
gr062: frame #16: PyDict_SetItemString + 0x4a (0x55bbd032581a in /ext3/miniconda3/bin/python3.9)
gr062: frame #17: <unknown function> + 0x214aec (0x55bbd0417aec in /ext3/miniconda3/bin/python3.9)
gr062: frame #18: Py_FinalizeEx + 0x186 (0x55bbd0416f56 in /ext3/miniconda3/bin/python3.9)
gr062: frame #19: Py_RunMain + 0x112 (0x55bbd040a2b2 in /ext3/miniconda3/bin/python3.9)
gr062: frame #20: Py_BytesMain + 0x39 (0x55bbd03dcb79 in /ext3/miniconda3/bin/python3.9)
gr062: frame #21: __libc_start_main + 0xf3 (0x7f7b5cb060b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr062: frame #22: <unknown function> + 0x1d9a81 (0x55bbd03dca81 in /ext3/miniconda3/bin/python3.9)
@stas00 @RezaYazdaniAminabadi
I get the same error for batch size > 1:
gr062: RuntimeError: CUDA error: an illegal memory access was encountered gr062: terminate called after throwing an instance of 'c10::CUDAError' gr062: what(): CUDA error: an illegal memory access was encountered gr062: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first): gr062: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7ad7777477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so) gr062: frame #1: <unknown function> + 0x1d4a3 (0x7f7b04d684a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) gr062: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f7b04d6e417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) gr062: frame #3: <unknown function> + 0x458c68 (0x7f7b1755cc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so) gr062: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f7ad775ad95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so) gr062: frame #5: <unknown function> + 0x34db35 (0x7f7b17451b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so) gr062: frame #6: <unknown function> + 0x681fc8 (0x7f7b17785fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so) gr062: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f7b177862c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so) gr062: frame #8: <unknown function> + 0x127e28 (0x55bbd032ae28 in /ext3/miniconda3/bin/python3.9) gr062: frame #9: <unknown function> + 0x134ad8 (0x55bbd0337ad8 in /ext3/miniconda3/bin/python3.9) gr062: frame #10: <unknown function> + 0x1487ce (0x55bbd034b7ce in /ext3/miniconda3/bin/python3.9) gr062: frame #11: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9) gr062: frame #12: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9) gr062: frame #13: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9) gr062: frame #14: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9) gr062: frame #15: <unknown function> + 0x11c661 (0x55bbd031f661 in /ext3/miniconda3/bin/python3.9) gr062: frame #16: PyDict_SetItemString + 0x4a (0x55bbd032581a in /ext3/miniconda3/bin/python3.9) gr062: frame #17: <unknown function> + 0x214aec (0x55bbd0417aec in /ext3/miniconda3/bin/python3.9) gr062: frame #18: Py_FinalizeEx + 0x186 (0x55bbd0416f56 in /ext3/miniconda3/bin/python3.9) gr062: frame #19: Py_RunMain + 0x112 (0x55bbd040a2b2 in /ext3/miniconda3/bin/python3.9) gr062: frame #20: Py_BytesMain + 0x39 (0x55bbd03dcb79 in /ext3/miniconda3/bin/python3.9) gr062: frame #21: __libc_start_main + 0xf3 (0x7f7b5cb060b3 in /lib/x86_64-linux-gnu/libc.so.6) gr062: frame #22: <unknown function> + 0x1d9a81 (0x55bbd03dca81 in /ext3/miniconda3/bin/python3.9)
@asaparov Okay, at least this is reproducible, thanks.
I am not sure why I am getting the same error ^^ for batch size = 1. @pai4451 Any pointers?
I am not sure why I am getting the same error ^^ for batch size = 1. @pai4451 Any pointers?
What is your CUDA version and DeepSpeed? I personally had CUDA11.5 and DeepSpeed 0.7.0 installed from ds-inference/bloom-fix branch, and I can inference BLOOM with batch size equal to 1 on two nodes.
I am not sure why I am getting the same error ^^ for batch size = 1. @pai4451 Any pointers?
What is your CUDA version and DeepSpeed? I personally had CUDA11.5 and DeepSpeed 0.7.0 installed from
ds-inference/bloom-fixbranch, and I can inference BLOOM with batch size equal to 1 on two nodes.
I am using CUDA-11.6 and deepspeed is built from master
@mayank31398 Perhaps try the ds-inference/bloom-fix branch of deepspeed?
@mayank31398 Perhaps try the
ds-inference/bloom-fixbranch of deepspeed?
Ill try this today. thanks
Actually, I just tried running with larger batch sizes (16 and 32) and it doesn't run into the "CUDA illegal memory access" error (as I did with batch size=2). Maybe it is intermittent? Or maybe something's wrong with batch size 2 specifically.
Actually, I just tried running with larger batch sizes (16 and 32) and it doesn't run into the "CUDA illegal memory access" error (as I did with batch size=2). Maybe it is intermittent? Or maybe something's wrong with batch size 2 specifically.
We (with @pai4451) tried batch_size from 8 to 2, all of them failed. but yet try batch_size > 8. Pai will test it today to see what happen in our side.
@asaparov I tried the inference script with batch sizes = 1, 2, 4, 8, 16, 32, 64 and 128. Only batch sizes equal 1 and 32 work, which is a bit surprising. Anyways we’ll have to wait someone to fix the issue in this repo.
Hi all,
There are some new changes merged at DeepSpeed master. Would you mind trying that? I have tried with batch 1 and 128 and both are working on my side (I ran it on 8 A100 80GB). I will try on A100-40G as well to make sure all is fine. Also, you can now generate MP-sharded checkpoints to load the model much faster. You can find more information in this PR: https://github.com/microsoft/DeepSpeed/pull/2132 Thanks, Reza
@RezaYazdaniAminabadi could you give some hint (where to get the doc) about "generate MP-sharded checkpoints"? So far we have only the 70 .bin files downloaded from huggingface. Do you mean there's a tool re-formatting these 70 files into world-size pieces to speed up model loading? Thanks in advance.
Hi @pohunghuang-nctu
Sure, you need to pass save_mp_checkpoint_path to the init_inference method in order to save the tp-sharded checkpoints in the path you specified. You will see that after loading the checkpoint, DeepSpeed starts saving the new checkpoints, and you will eventually have the tp-sharded checkpoints. In addition, there will be a json config file saved in that path (like bloom_ds-inference-config.json) that you can pass as the checkpoint argument to init_inference in the next run. Note that you can remove save_mp_checkpoint_path after you save the tp-sharded checkpoints for the first time, so that DeepSpeed doesn't always save a new checkpoint for you.
Best, Reza
@RezaYazdaniAminabadi I was testing with the newly merged code last night but still hit the illegal memory accesses intermittently on the larger batch sizes. It wasn't like throwing a dice though, it would work for like a half hour and then stop working for another block of time and then start working again.
For the first time I was able to use some larger batch sizes though (at least part of the time), so something seems to have improved.
EDIT: these tests were on 8x A100 80GB
I am glad you could run it with large batch now! :) I think this might be related to some cache allocation issues. We are working on optimizing that part too.