Megatron-DeepSpeed Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all

I am trying to get multi-node inference working with 4 nodes, each with 4xRTX8000 GPUs (48GB per GPU). deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom.

The script finishes loading all the checkpoints and begins inference but then quickly runs into the following error:

...
gr061: loading checkpoint (68)
gr061: loading checkpoint (69)
gr061: loading checkpoint (70)
gr063: [2022-07-20 19:03:10,723] [INFO] [engine.py:144:__init__] Place model to device: 0
gr061: loading checkpoint (71)
gr061: [2022-07-20 19:03:11,443] [INFO] [engine.py:144:__init__] Place model to device: 0
gr061: *** Starting to generate 100 tokens with bs=1
gr061: Generate args {'max_new_tokens': 100, 'do_sample': False}
gr064: [2022-07-20 19:03:12,551] [INFO] [engine.py:144:__init__] Place model to device: 3
gr061: [2022-07-20 19:03:13,294] [INFO] [engine.py:144:__init__] Place model to device: 3
gr062: [2022-07-20 19:03:14,244] [INFO] [engine.py:144:__init__] Place model to device: 2
gr062: [2022-07-20 19:03:14,406] [INFO] [engine.py:144:__init__] Place model to device: 0
gr063: [2022-07-20 19:03:14,791] [INFO] [engine.py:144:__init__] Place model to device: 2
gr064: [2022-07-20 19:03:15,444] [INFO] [engine.py:144:__init__] Place model to device: 2
gr061: [2022-07-20 19:03:15,542] [INFO] [engine.py:144:__init__] Place model to device: 2
gr061: [2022-07-20 19:03:15,618] [INFO] [engine.py:144:__init__] Place model to device: 1
gr062: [2022-07-20 19:03:16,179] [INFO] [engine.py:144:__init__] Place model to device: 3
gr062: [2022-07-20 19:03:16,513] [INFO] [engine.py:144:__init__] Place model to device: 1
gr064: [2022-07-20 19:03:16,777] [INFO] [engine.py:144:__init__] Place model to device: 0
gr064: [2022-07-20 19:03:17,541] [INFO] [engine.py:144:__init__] Place model to device: 1
gr063: [2022-07-20 19:03:18,336] [INFO] [engine.py:144:__init__] Place model to device: 3
gr063: [2022-07-20 19:03:18,547] [INFO] [engine.py:144:__init__] Place model to device: 1
gr064: Traceback (most recent call last):
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
gr064:     _ = generate()
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
gr064:     outputs = model.generate(**input_tokens, **generate_kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
gr064: Traceback (most recent call last):
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
gr064:     _ = generate()
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
gr064:     outputs = model.generate(**input_tokens, **generate_kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
gr064: Traceback (most recent call last):
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
gr064:     _ = generate()
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
gr064:     outputs = model.generate(**input_tokens, **generate_kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
gr064:     return func(*args, **kwargs)
gr064:       File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064: return func(*args, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064:     return func(*args, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064:     return func(*args, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064:     outputs = self(
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     outputs = self(
gr064:     outputs = self(
gr064:       File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: outputs = self(
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:       File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064:     return forward_call(*input, **kwargs)return forward_call(*input, **kwargs)
gr064:
gr064:   File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064:   File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064:         outputs = self.model_orig_fwd(*inputs, **kwargs)outputs = self.model_orig_fwd(*inputs, **kwargs)
gr064:
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064:     outputs = self.model_orig_fwd(*inputs, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064:     outputs = self.model_orig_fwd(*inputs, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064:                 transformer_outputs = self.transformer(transformer_outputs = self.transformer(transformer_outputs = self.transformer(transformer_outputs = self.transformer(
gr064:
gr064:
gr064:
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064:         return forward_call(*input, **kwargs)return forward_call(*input, **kwargs)
gr064:
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064:       File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064: return forward_call(*input, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064:     outputs = block(
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     outputs = block(
gr064: outputs = block(
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     outputs = block(
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064:     self.attention(input,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     self.attention(input,
gr064:       File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: self.attention(input,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     self.attention(input,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064:     output = DeepSpeedSelfAttentionFunction.apply(
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064:     output = DeepSpeedSelfAttentionFunction.apply(
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064:     output = DeepSpeedSelfAttentionFunction.apply(
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064:     output = DeepSpeedSelfAttentionFunction.apply(
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064:     dist.all_reduce(output, group=mp_group)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064:     dist.all_reduce(output, group=mp_group)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064:     dist.all_reduce(output, group=mp_group)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064:     dist.all_reduce(output, group=mp_group)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064:     return cdb.all_reduce(tensor, op, group, async_op)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064:     return cdb.all_reduce(tensor, op, group, async_op)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064:     return torch.distributed.all_reduce(tensor=tensor,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064:     return torch.distributed.all_reduce(tensor=tensor,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064:     return cdb.all_reduce(tensor, op, group, async_op)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064:     return torch.distributed.all_reduce(tensor=tensor,
gr064:     return cdb.all_reduce(tensor, op, group, async_op)  File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064:
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064:     return torch.distributed.all_reduce(tensor=tensor,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064:     work = group.allreduce([tensor], opts)
gr064: work = group.allreduce([tensor], opts)
gr064:     work = group.allreduce([tensor], opts)
gr064: RuntimeErrorRuntimeError: :     NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.RuntimeError
gr064: work = group.allreduce([tensor], opts)
gr064: :
gr064: NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.
gr064: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064:   what():  CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fae5f70b477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7fae8ccfc4a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7fae8cd02417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7fae9f4f0c68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fae5f6eed95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7fae9f3e5b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7fae9f719fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7fae9f71a2c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x55ccd72e1e28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x55ccd72eead8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x55ccd73027ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x55ccd72d6661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x55ccd72dc81a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x55ccd73ceaec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x55ccd73cdf56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x55ccd73c12b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x55ccd7393b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7faee4a9a0b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x55ccd7393a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064:   what():  CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f183ee2a477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7f186c41b4a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f186c421417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7f187ec0fc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f183ee0dd95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7f187eb04b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7f187ee38fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f187ee392c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x5616533d4e28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x5616533e1ad8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x5616533f57ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x5616533c9661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x5616533cf81a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x5616534c1aec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x5616534c0f56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x5616534b42b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x561653486b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7f18c41b90b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x561653486a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064:   what():  CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb213ab8477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7fb2410a94a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7fb2410af417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7fb25389dc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fb213a9bd95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7fb253792b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7fb253ac6fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7fb253ac72c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x5616125aee28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x5616125bbad8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x5616125cf7ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x5616125a3661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x5616125a981a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x56161269baec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x56161269af56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x56161268e2b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x561612660b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7fb298e470b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x561612660a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064:   what():  CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8724e9e477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7f875248f4a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f8752495417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7f8764c83c68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f8724e81d95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7f8764b78b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7f8764eacfc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f8764ead2c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x5640a0321e28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x5640a032ead8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x5640a03427ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x5640a0316661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x5640a031c81a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x5640a040eaec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x5640a040df56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x5640a04012b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x5640a03d3b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7f87aa22d0b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x5640a03d3a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: [2022-07-20 19:03:32,219] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678791
gr064: [2022-07-20 19:03:32,220] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678792
gr064: [2022-07-20 19:03:32,220] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678793
gr064: [2022-07-20 19:03:32,220] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678794
gr064: [2022-07-20 19:03:32,220] [ERROR] [launch.py:184:sigkill_handler] ['/ext3/miniconda3/bin/python3.9', '-u', 'Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py', '--local_rank=3', '--name', 'bigscience/bloom'] exits with return code = -6
pdsh@gr061: gr064: ssh exited with exit code 250
pdsh@gr061: gr062: ssh exited with exit code 250
pdsh@gr061: gr061: ssh exited with exit code 250

I've tried with CUDA 10.2 and 11.6 and there's no difference.

Jul 21 '22 05:07 asaparov

@stas00

Jul 23 '22 22:07 asaparov

Yeah, I get that too when I try to load too much of a batch size. But if you're running my script its default is bs=1 so shouldn't really be a problem. I haven't tried it on your setup. But the issue is on the DS-Inference side.

@RezaYazdaniAminabadi, as you can see both I and many others run into this issue - could we change the kernel code to be more defensive? It's always the same group.allreduce([tensor], opts) where it happens.

Jul 24 '22 05:07 stas00

Hi @stas00 ,

Thanks for tagging me here. I will definitely look into this and try to fix it soon.

Best, Reza

Jul 24 '22 07:07 RezaYazdaniAminabadi

@asaparov, please run the following 2 experiments

same set up as your but add: CUDA_LAUNCH_BLOCKING=1 as in:

CUDA_LAUNCH_BLOCKING=1 deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom

and let's see if it starts working

does it fail in the same way if you use "bigscience/bloom-1b3" - just to check that it's the issue with size and not the setup/system. But don't use CUDA_LAUNCH_BLOCKING=1 this time. That is:

deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom-1b3

Thank you!

Jul 25 '22 23:07 stas00

@stas00 It seems to be working with CUDA_LAUNCH_BLOCKING=1!

I'll test with bigscience/bloom-1b3 next.

Jul 26 '22 16:07 asaparov

Thank you for reporting back, @asaparov! You may use this way for now, it will be just a tad slower, until the underlying issue is resolved. The difficulty is in reproducing it.

@RezaYazdaniAminabadi, so @asaparov's success with CUDA_LAUNCH_BLOCKING=1 is pointing to some unsynchronized code in the kernels. As I proposed yesterday.

Jul 26 '22 16:07 stas00

@stas00 Actually I just tested both bigscience/bloom and bigscience/bloom-1b3 without CUDA_LAUNCH_BLOCKING=1 and they both work. This is probably because I pulled newer code from the bloom-inference branch of this repo (commit b76e516) and the code from the ds-inference/bloom-fix branch of DeepSpeed (commit f39c78f).

I had to fix a few bugs related to save_mp_checkpoint_path being set to False instead of None, but everything seems to work fine after that.

Jul 26 '22 20:07 asaparov

I suspect that the bug is intermittent as it pops up in various situations and inconsistent. But if it works at the moment for you that's great!

Yes, the save_mp_checkpoint_path was just added and still being fixed up.

It basically allows you to set the tp-sharded path and then it'll save the new checkpoint - and the load time from it will be 1-2min instead of 10-20min. You may want to give it a try.

once the checkpoint is created you need to set parallelization="tp".

the 2 new changes are, the addition of save_mp_checkpoint_path to save the tp sharded weights on init.

kwargs["save_mp_checkpoint_path"] = checkpoint_dir

#checkpoints_json=None
model = deepspeed.init_inference(model,
                                 mp_size=world_size,
                                 dtype=torch.half,
                                 checkpoint=checkpoints_json,
                                 **kwargs,
                                 )

and the addition of parallelization in the checkpoint json format

checkpoint_type = "tp"
checkpoint_dir = "/home/nicolas_huggingface_co/src/Megatron-DeepSpeed/bloom-tp"

checkpoint_files = glob.glob(f"{checkpoint_dir}/*pt")
if len(checkpoint_files) == 0:
    # hf checkpoint
    checkpoint_files = get_checkpoint_files(model_name)
    checkpoint_type = "pp" # normal hf hub checkpoint

if rank == 0:
    print("Checkpoint files:", checkpoint_files)
    print("Checkpoint type:", checkpoint_type)

checkpoints_json = "checkpoints.json"
def write_checkponts_json():

    with io.open(checkpoints_json, 'w', encoding='utf-8') as f:

        data = {
            "type": "BLOOM-176B",
            "checkpoints": checkpoint_files,
            "version": 1.0,
            "parallelization": checkpoint_type,
        }

the 2 values are pp (normal hf checkpoint) and tp tp-sharded checkpoint.

I will make it all configurable once the dust settles.

Jul 26 '22 21:07 stas00

Hi @asaparov

It's great to see your issue is solved. As @stas00 mentioned the part regarding the new checkpoint loading is coming soon too. @stas00, thanks for full details here :)

Best, Reza

Jul 27 '22 00:07 RezaYazdaniAminabadi

@stas00 Actually I just tested both bigscience/bloom and bigscience/bloom-1b3 without CUDA_LAUNCH_BLOCKING=1 and they both work. This is probably because I pulled newer code from the bloom-inference branch of this repo (commit b76e516) and the code from the ds-inference/bloom-fix branch of DeepSpeed (commit f39c78f).

I had to fix a few bugs related to save_mp_checkpoint_path being set to False instead of None, but everything seems to work fine after that.

@asaparov Can you share your code for inference BLOOM or give me an idea on which inference repo did you use and did you make any code modification? I have the same hardware requirements as yours but I can’t get rid of CUDA errors even adding CUDA_LAUNCH_BLOCKING=1. I used the inference code on branch bloom-inference and DeepSpeed branch ds-inference/bloom-fix. Also did you set the environment variable WORLD_SIZE?

Jul 27 '22 04:07 pai4451

@pai4451 I didn't change any code from this repo at all. I followed the installation instructions in the readme. I invoke the inference script using: deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom

I'm running everything in a conda environment in a singularity container. The output of conda info is:

Singularity> conda list
# packages in environment at /ext3/miniconda3:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                  2_kmp_llvm    conda-forge
absl-py                   1.2.0                    pypi_0    pypi
aiohttp                   3.8.1                    pypi_0    pypi
aiosignal                 1.2.0                    pypi_0    pypi
apex                      0.1                      pypi_0    pypi
appdirs                   1.4.4                    pypi_0    pypi
async-timeout             4.0.2                    pypi_0    pypi
attrs                     21.4.0                   pypi_0    pypi
black                     21.4b0                   pypi_0    pypi
blas                      2.115                       mkl    conda-forge
blas-devel                3.9.0            15_linux64_mkl    conda-forge
brotlipy                  0.7.0           py39hb9d737c_1004    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2022.6.15            ha878542_0    conda-forge
cachetools                5.2.0                    pypi_0    pypi
certifi                   2022.6.15        py39hf3d152e_0    conda-forge
cffi                      1.15.1           py39he91dace_0    conda-forge
charset-normalizer        2.1.0              pyhd8ed1ab_0    conda-forge
click                     8.1.3                    pypi_0    pypi
colorama                  0.4.5              pyhd8ed1ab_0    conda-forge
conda                     4.13.0           py39hf3d152e_1    conda-forge
conda-package-handling    1.8.1            py39hb9d737c_1    conda-forge
cryptography              37.0.4           py39hd97740a_0    conda-forge
cudatoolkit               11.6.0              hecad31d_10    conda-forge
datasets                  2.4.0                    pypi_0    pypi
deepspeed                 0.7.0+f39c78f9            dev_0    <develop>
dill                      0.3.5.1                  pypi_0    pypi
filelock                  3.7.1                    pypi_0    pypi
frozenlist                1.3.0                    pypi_0    pypi
fsspec                    2022.5.0                 pypi_0    pypi
google-auth               2.9.1                    pypi_0    pypi
google-auth-oauthlib      0.4.6                    pypi_0    pypi
grpcio                    1.47.0                   pypi_0    pypi
hjson                     3.0.2                    pypi_0    pypi
huggingface-hub           0.8.1                    pypi_0    pypi
idna                      3.3                pyhd8ed1ab_0    conda-forge
importlib-metadata        4.12.0                   pypi_0    pypi
isort                     5.10.1                   pypi_0    pypi
joblib                    1.1.0                    pypi_0    pypi
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
libaio                    0.3.113              h5eee18b_0    <unknown>
libblas                   3.9.0            15_linux64_mkl    conda-forge
libcblas                  3.9.0            15_linux64_mkl    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 12.1.0              h8d9b700_16    conda-forge
libgfortran-ng            12.1.0              h69a702a_16    conda-forge
libgfortran5              12.1.0              hdcd56e2_16    conda-forge
libgomp                   12.1.0              h8d9b700_16    conda-forge
liblapack                 3.9.0            15_linux64_mkl    conda-forge
liblapacke                3.9.0            15_linux64_mkl    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libstdcxx-ng              12.1.0              ha89aaad_16    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libzlib                   1.2.12               h166bdaf_2    conda-forge
llvm-openmp               14.0.4               he0ac6c6_0    conda-forge
markdown                  3.4.1                    pypi_0    pypi
markupsafe                2.1.1                    pypi_0    pypi
mkl                       2022.1.0           h84fe81f_915    conda-forge
mkl-devel                 2022.1.0           ha770c72_916    conda-forge
mkl-include               2022.1.0           h84fe81f_915    conda-forge
multidict                 6.0.2                    pypi_0    pypi
multiprocess              0.70.13                  pypi_0    pypi
mypy-extensions           0.4.3                    pypi_0    pypi
ncurses                   6.3                  h27087fc_1    conda-forge
ninja                     1.10.2.3                 pypi_0    pypi
nltk                      3.7                      pypi_0    pypi
numpy                     1.23.1                   pypi_0    pypi
oauthlib                  3.2.0                    pypi_0    pypi
openssl                   1.1.1q               h166bdaf_0    conda-forge
packaging                 21.3                     pypi_0    pypi
pandas                    1.4.3                    pypi_0    pypi
parameterized             0.8.1                    pypi_0    pypi
pathspec                  0.9.0                    pypi_0    pypi
pip                       22.2               pyhd8ed1ab_0    conda-forge
protobuf                  3.19.4                   pypi_0    pypi
psutil                    5.9.1                    pypi_0    pypi
py-cpuinfo                8.0.0                    pypi_0    pypi
pyarrow                   8.0.0                    pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pybind11                  2.10.0                   pypi_0    pypi
pycosat                   0.6.3           py39hb9d737c_1010    conda-forge
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pydantic                  1.9.1                    pypi_0    pypi
pyopenssl                 22.0.0             pyhd8ed1ab_0    conda-forge
pyparsing                 3.0.9                    pypi_0    pypi
pysocks                   1.7.1            py39hf3d152e_5    conda-forge
python                    3.9.13          h9a8a25e_0_cpython    conda-forge
python-dateutil           2.8.2                    pypi_0    pypi
python_abi                3.9                      2_cp39    conda-forge
pytorch                   1.12.0          py3.9_cuda11.6_cudnn8.3.2_0    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pytz                      2022.1                   pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
readline                  8.1.2                h0f457ee_0    conda-forge
regex                     2022.7.25                pypi_0    pypi
requests                  2.28.1             pyhd8ed1ab_0    conda-forge
requests-oauthlib         1.3.1                    pypi_0    pypi
responses                 0.18.0                   pypi_0    pypi
rsa                       4.9                      pypi_0    pypi
ruamel_yaml               0.15.80         py39hb9d737c_1007    conda-forge
setuptools                63.2.0           py39hf3d152e_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sqlite                    3.39.2               h4ff8645_0    conda-forge
tbb                       2021.5.0             h924138e_1    conda-forge
tensorboard               2.9.1                    pypi_0    pypi
tensorboard-data-server   0.6.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.1                    pypi_0    pypi
tk                        8.6.12               h27826a3_0    conda-forge
tokenizers                0.12.1                   pypi_0    pypi
toml                      0.10.2                   pypi_0    pypi
tqdm                      4.64.0             pyhd8ed1ab_0    conda-forge
transformers              4.20.1                   pypi_0    pypi
typing_extensions         4.3.0              pyha770c72_0    conda-forge
tzdata                    2022a                h191b570_0    conda-forge
urllib3                   1.26.11            pyhd8ed1ab_0    conda-forge
werkzeug                  2.2.0                    pypi_0    pypi
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xxhash                    3.0.0                    pypi_0    pypi
xz                        5.2.5                h516909a_1    conda-forge
yaml                      0.2.5                h7f98852_2    conda-forge
yarl                      1.7.2                    pypi_0    pypi
zipp                      3.8.1                    pypi_0    pypi
zlib                      1.2.12               h166bdaf_2    conda-forge

For this repo and deepspeed, I'm using the commits that I mention above. I had a few errors from deepspeed complaining about save_mp_checkpoint_path which I fixed with the following changes:

diff --git a/deepspeed/__init__.py b/deepspeed/__init__.py
index 655d7a96..50049a2a 100755
--- a/deepspeed/__init__.py
+++ b/deepspeed/__init__.py
@@ -239,7 +239,7 @@ def init_inference(model,
                    moe_type='standard',
                    args=None,
                    enable_cuda_graph=False,
-                   save_mp_checkpoint_path=False):
+                   save_mp_checkpoint_path=None):
     """Initialize the DeepSpeed InferenceEngine.

     Arguments:
diff --git a/deepspeed/inference/engine.py b/deepspeed/inference/engine.py
index b5841dab..f380cd21 100755
--- a/deepspeed/inference/engine.py
+++ b/deepspeed/inference/engine.py
@@ -50,7 +50,7 @@ class InferenceEngine(Module):
                  moe_type='standard',
                  config=None,
                  enable_cuda_graph=False,
-                 save_mp_checkpoint_path=False):
+                 save_mp_checkpoint_path=None):
         """
         Args:
             model: torch.nn.Module
@@ -322,7 +322,7 @@ class InferenceEngine(Module):
                                 moe_type='standard',
                                 training_mp_size=1,
                                 checkpoint_dir=None,
-                                save_mp_checkpoint_path=False):
+                                save_mp_checkpoint_path=None):
         checkpoint, ckpt_type = SDLoaderFactory.get_sd_loader_json(
             checkpoint_dir) if checkpoint_dir is not None else (None, None)
         replace_transformer_layer(client_module,

I also had to make a few other edits to deepspeed since I wanted each worker to run within the singularity container, and to prevent ssh from complaining about host key authentication (I'm running this on a cluster).

Jul 27 '22 05:07 asaparov

@asaparov Thanks for the details. I can finally inference BLOOM with DeepSpeed on multiple nodes now. However, it only works for batch_size=1, and when I increase the batch size, error message RuntimeError: CUDA error: an illegal memory access was encountered throw out again. Do you have the same issue or can you inference with batch size more than 1 on you side? Thank you.

Jul 27 '22 07:07 pai4451

Hmm, its not working for me even within a single node with batch size = 1, 8x A100 80gb Same, CUDA illegal memory access error

Jul 27 '22 08:07 mayank31398

Hmm, its not working for me even within a single node with batch size = 1, 8x A100 80gb Same, CUDA illegal memory access error

See if "NCCL WARN Call to ibv_reg_reg_mr failed" appearing on your log. In my case, we modify /etc/security/limits.conf to resolve it. you could find detail here. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html

But also only work for batch size == 1

Jul 27 '22 09:07 pohunghuang-nctu

@pohunghuang-nctu nothing like that in my logs This is the full trace

[2022-07-26 11:41:08,472] [WARNING] [runner.py:159:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-07-26 11:41:11,508] [INFO] [runner.py:457:main] cmd = /net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --benchmark
[2022-07-26 11:41:12,431] [INFO] [launch.py:96:main] 0 NCCL_IB_DISABLE=1
[2022-07-26 11:41:12,431] [INFO] [launch.py:96:main] 0 NCCL_DEBUG=INFO
[2022-07-26 11:41:12,431] [INFO] [launch.py:103:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2022-07-26 11:41:12,431] [INFO] [launch.py:109:main] nnodes=1, num_local_procs=8, node_rank=0
[2022-07-26 11:41:12,431] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2022-07-26 11:41:12,431] [INFO] [launch.py:123:main] dist_world_size=8
[2022-07-26 11:41:12,431] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2022-07-26 11:41:13,715] [INFO] [comm.py:423:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model bigscience/bloom
[2022-07-26 11:41:22,608] [INFO] [utils.py:827:see_memory_usage] pre-from-pretrained
[2022-07-26 11:41:22,608] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-07-26 11:41:22,608] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 11.2 GB, percent = 0.9%
[2022-07-26 11:41:22,745] [INFO] [utils.py:827:see_memory_usage] post-from-pretrained
[2022-07-26 11:41:22,746] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-07-26 11:41:22,746] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 11.21 GB, percent = 0.9%
[2022-07-26 11:41:22,795] [INFO] [utils.py:827:see_memory_usage] post-init-ds-zero-init
[2022-07-26 11:41:22,795] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-07-26 11:41:22,796] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 11.27 GB, percent = 0.9%
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.6
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO Using network Socket
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO Using network Socket
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO Using network Socket
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO Using network Socket
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO Using network Socket
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO Using network Socket
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO Using network Socket
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 00 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 01 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 02 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 03 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 04 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 05 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 00 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 06 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 01 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 00 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 07 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 00 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 02 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 01 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 08 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 01 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 03 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 02 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 09 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 02 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 04 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 03 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 10 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 03 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 05 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 04 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 11 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 04 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 06 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 05 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 12 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 05 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 00 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 07 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 06 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 13 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 06 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 01 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 08 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 07 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 14 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 07 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 02 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 09 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 08 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 15 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 00 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 08 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 03 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 10 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 09 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 16 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 01 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 09 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 04 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 11 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 10 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 17 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 10 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 02 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 05 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 00 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 12 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 11 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 18 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 11 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 03 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 06 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 01 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 13 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 12 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 19 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 00 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 12 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 04 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 07 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 02 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 14 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 13 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 20 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 01 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 13 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 05 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 08 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 03 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 15 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 14 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 21 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 02 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 14 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 06 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 09 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 04 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 16 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 15 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 22 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 03 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 15 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 07 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 10 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 05 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 17 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 16 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 23 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 04 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 16 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 08 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 11 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 06 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 18 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 17 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 05 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 17 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 09 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 12 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 07 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 19 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 18 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 06 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 18 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 10 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 13 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 08 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 20 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 19 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 07 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 19 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 14 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 11 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 09 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 21 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 20 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 08 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 20 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 15 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 12 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 10 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 22 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 21 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 09 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 21 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 16 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 13 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 11 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 23 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 22 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 10 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 22 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 17 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 14 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 12 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 23 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 11 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 23 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 18 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 15 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 13 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 12 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 19 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 14 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 16 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 13 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 15 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 20 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 17 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 14 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 16 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 21 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 18 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 15 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 17 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 22 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 19 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 16 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 18 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 23 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 20 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 17 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 19 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 21 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 18 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 20 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 22 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 19 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 21 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 23 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 22 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 20 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 23 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 21 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 22 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 23 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Connected all rings
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Connected all rings
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Connected all rings
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Connected all rings
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Connected all rings
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 00 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Connected all rings
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Connected all rings
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 01 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Connected all rings
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 02 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 03 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 04 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 05 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 06 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 07 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 08 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 09 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 10 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 11 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 12 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 13 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 14 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 15 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 16 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 17 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 18 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 00 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 19 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 01 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 20 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 02 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 21 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 03 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 22 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 04 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 23 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 00 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 00 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 05 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 01 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 01 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 06 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 02 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 00 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 02 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 07 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 00 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 03 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 03 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 01 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 08 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 04 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 01 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 04 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 02 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 09 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 05 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 02 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 00 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 03 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 05 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 10 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 06 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 03 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 01 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 06 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 04 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 11 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 07 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 04 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 02 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 07 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 05 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 12 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 08 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 05 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 03 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 06 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 08 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 13 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 09 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 06 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 04 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 07 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 09 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 14 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 10 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 07 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 05 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 08 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 15 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 10 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 11 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 08 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 06 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 16 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 09 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 12 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 11 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 09 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 07 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 17 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 13 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 10 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 12 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 10 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 08 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 14 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 18 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 13 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 11 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 11 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 09 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 19 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 15 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 14 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 12 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 12 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 10 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 16 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 20 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 15 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 13 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 13 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 11 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 17 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 21 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 16 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 14 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 14 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 12 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 18 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 22 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 17 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 15 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 15 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 13 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 19 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 23 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 18 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 16 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 16 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 14 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 20 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 19 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 17 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 17 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 15 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 21 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 20 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 18 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 18 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 16 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 22 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 21 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 19 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 19 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 17 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 23 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 22 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 20 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 18 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 20 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 23 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 21 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 19 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 21 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 22 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 20 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 22 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 23 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 21 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 23 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 22 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 23 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Connected all trees
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Connected all trees
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Connected all trees
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Connected all trees
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Connected all trees
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Connected all trees
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Connected all trees
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Connected all trees
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO comm 0x7f6890002fb0 rank 1 nranks 8 cudaDev 1 busId 4080 - Init COMPLETE
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO comm 0x7fbcc4002fb0 rank 4 nranks 8 cudaDev 4 busId 40b0 - Init COMPLETE
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO comm 0x7f0b9c002fb0 rank 2 nranks 8 cudaDev 2 busId 4090 - Init COMPLETE
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO comm 0x7f09a0002fb0 rank 6 nranks 8 cudaDev 6 busId 40d0 - Init COMPLETE
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO comm 0x7f61d0002fb0 rank 3 nranks 8 cudaDev 3 busId 40a0 - Init COMPLETE
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO comm 0x7fbd04002fb0 rank 0 nranks 8 cudaDev 0 busId 4070 - Init COMPLETE
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO Launch mode Parallel
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO comm 0x7f03dc002fb0 rank 5 nranks 8 cudaDev 5 busId 40c0 - Init COMPLETE
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO comm 0x7f1000002fb0 rank 7 nranks 8 cudaDev 7 busId 40e0 - Init COMPLETE
[2022-07-26 11:41:29,495] [INFO] [utils.py:827:see_memory_usage] pre-ds-inference-init
[2022-07-26 11:41:29,495] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-07-26 11:41:29,496] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 19.92 GB, percent = 1.6%
[2022-07-26 11:41:29,496] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.7.0+b6305d0e, git-hash=b6305d0e, git-branch=master
[2022-07-26 11:41:29,496] [INFO] [logging.py:69:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.25245213508605957 seconds
[2022-07-26 11:41:30,151] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 14336, 'intermediate_size': 57344, 'heads': 112, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 8, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': True}
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.2497098445892334 seconds
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.2436366081237793 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.24797964096069336 seconds
Time to load transformer_inference op: 0.24489784240722656 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.2467021942138672 seconds
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.24748826026916504 seconds
Time to load transformer_inference op: 0.24941658973693848 seconds
Loading 72 checkpoint shards:   0%|          | 0/72 [11:08<?, ?it/s]9.89s/it]
[2022-07-26 11:52:39,789] [INFO] [engine.py:145:__init__] Place model to device: 6
Loading 72 checkpoint shards:   0%|          | 0/72 [11:09<?, ?it/s]
[2022-07-26 11:52:39,989] [INFO] [engine.py:145:__init__] Place model to device: 1
Loading 72 checkpoint shards:   0%|          | 0/72 [11:10<?, ?it/s]
[2022-07-26 11:52:41,127] [INFO] [engine.py:145:__init__] Place model to device: 3
Loading 72 checkpoint shards:   0%|          | 0/72 [11:14<?, ?it/s]
[2022-07-26 11:52:45,432] [INFO] [engine.py:145:__init__] Place model to device: 5
Loading 72 checkpoint shards:   0%|          | 0/72 [11:22<?, ?it/s]9.83s/it]
[2022-07-26 11:52:53,353] [INFO] [engine.py:145:__init__] Place model to device: 7
Loading 72 checkpoint shards:   0%|          | 0/72 [11:24<?, ?it/s]
[2022-07-26 11:52:55,107] [INFO] [engine.py:145:__init__] Place model to device: 2
Loading 72 checkpoint shards: 100%|██████████| 72/72 [11:24<00:00,  9.51s/it]
[2022-07-26 11:52:55,582] [INFO] [engine.py:145:__init__] Place model to device: 0
[2022-07-26 11:52:55,707] [INFO] [utils.py:827:see_memory_usage] post-ds-inference-init
[2022-07-26 11:52:55,708] [INFO] [utils.py:828:see_memory_usage] MA 47.04 GB         Max_MA 47.24 GB         CA 47.04 GB         Max_CA 47 GB
[2022-07-26 11:52:55,709] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 25.77 GB, percent = 2.0%
*** Starting to generate 100 tokens with bs=1
Generate args {'max_new_tokens': 100, 'do_sample': False}
Loading 72 checkpoint shards:   0%|          | 0/72 [11:25<?, ?it/s]
[2022-07-26 11:52:56,613] [INFO] [engine.py:145:__init__] Place model to device: 4
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6

llm-test-cluster-9:1281342:1283501 [1] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

llm-test-cluster-9:1281344:1283502 [3] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

llm-test-cluster-9:1281343:1283503 [2] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

llm-test-cluster-9:1281347:1283504 [6] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

llm-test-cluster-9:1281346:1283505 [5] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

llm-test-cluster-9:1281348:1283506 [7] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

llm-test-cluster-9:1281345:1283507 [4] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

llm-test-cluster-9:1281341:1283500 [0] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

Jul 27 '22 09:07 mayank31398

I get the same error for batch size > 1, even with CUDA_LAUNCH_BLOCKING=1:

gr062: RuntimeError: CUDA error: an illegal memory access was encountered
gr062: terminate called after throwing an instance of 'c10::CUDAError'
gr062:   what():  CUDA error: an illegal memory access was encountered
gr062: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr062: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7ad7777477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr062: frame #1: <unknown function> + 0x1d4a3 (0x7f7b04d684a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr062: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f7b04d6e417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr062: frame #3: <unknown function> + 0x458c68 (0x7f7b1755cc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f7ad775ad95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr062: frame #5: <unknown function> + 0x34db35 (0x7f7b17451b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #6: <unknown function> + 0x681fc8 (0x7f7b17785fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f7b177862c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #8: <unknown function> + 0x127e28 (0x55bbd032ae28 in /ext3/miniconda3/bin/python3.9)
gr062: frame #9: <unknown function> + 0x134ad8 (0x55bbd0337ad8 in /ext3/miniconda3/bin/python3.9)
gr062: frame #10: <unknown function> + 0x1487ce (0x55bbd034b7ce in /ext3/miniconda3/bin/python3.9)
gr062: frame #11: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #12: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #13: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #14: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #15: <unknown function> + 0x11c661 (0x55bbd031f661 in /ext3/miniconda3/bin/python3.9)
gr062: frame #16: PyDict_SetItemString + 0x4a (0x55bbd032581a in /ext3/miniconda3/bin/python3.9)
gr062: frame #17: <unknown function> + 0x214aec (0x55bbd0417aec in /ext3/miniconda3/bin/python3.9)
gr062: frame #18: Py_FinalizeEx + 0x186 (0x55bbd0416f56 in /ext3/miniconda3/bin/python3.9)
gr062: frame #19: Py_RunMain + 0x112 (0x55bbd040a2b2 in /ext3/miniconda3/bin/python3.9)
gr062: frame #20: Py_BytesMain + 0x39 (0x55bbd03dcb79 in /ext3/miniconda3/bin/python3.9)
gr062: frame #21: __libc_start_main + 0xf3 (0x7f7b5cb060b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr062: frame #22: <unknown function> + 0x1d9a81 (0x55bbd03dca81 in /ext3/miniconda3/bin/python3.9)

@stas00 @RezaYazdaniAminabadi

Jul 27 '22 13:07 asaparov

I get the same error for batch size > 1:

gr062: RuntimeError: CUDA error: an illegal memory access was encountered
gr062: terminate called after throwing an instance of 'c10::CUDAError'
gr062:   what():  CUDA error: an illegal memory access was encountered
gr062: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr062: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7ad7777477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr062: frame #1: <unknown function> + 0x1d4a3 (0x7f7b04d684a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr062: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f7b04d6e417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr062: frame #3: <unknown function> + 0x458c68 (0x7f7b1755cc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f7ad775ad95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr062: frame #5: <unknown function> + 0x34db35 (0x7f7b17451b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #6: <unknown function> + 0x681fc8 (0x7f7b17785fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f7b177862c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #8: <unknown function> + 0x127e28 (0x55bbd032ae28 in /ext3/miniconda3/bin/python3.9)
gr062: frame #9: <unknown function> + 0x134ad8 (0x55bbd0337ad8 in /ext3/miniconda3/bin/python3.9)
gr062: frame #10: <unknown function> + 0x1487ce (0x55bbd034b7ce in /ext3/miniconda3/bin/python3.9)
gr062: frame #11: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #12: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #13: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #14: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #15: <unknown function> + 0x11c661 (0x55bbd031f661 in /ext3/miniconda3/bin/python3.9)
gr062: frame #16: PyDict_SetItemString + 0x4a (0x55bbd032581a in /ext3/miniconda3/bin/python3.9)
gr062: frame #17: <unknown function> + 0x214aec (0x55bbd0417aec in /ext3/miniconda3/bin/python3.9)
gr062: frame #18: Py_FinalizeEx + 0x186 (0x55bbd0416f56 in /ext3/miniconda3/bin/python3.9)
gr062: frame #19: Py_RunMain + 0x112 (0x55bbd040a2b2 in /ext3/miniconda3/bin/python3.9)
gr062: frame #20: Py_BytesMain + 0x39 (0x55bbd03dcb79 in /ext3/miniconda3/bin/python3.9)
gr062: frame #21: __libc_start_main + 0xf3 (0x7f7b5cb060b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr062: frame #22: <unknown function> + 0x1d9a81 (0x55bbd03dca81 in /ext3/miniconda3/bin/python3.9)

@asaparov Okay, at least this is reproducible, thanks.

Jul 27 '22 13:07 pai4451

I am not sure why I am getting the same error ^^ for batch size = 1. @pai4451 Any pointers?

Jul 27 '22 13:07 mayank31398

I am not sure why I am getting the same error ^^ for batch size = 1. @pai4451 Any pointers?

What is your CUDA version and DeepSpeed? I personally had CUDA11.5 and DeepSpeed 0.7.0 installed from ds-inference/bloom-fix branch, and I can inference BLOOM with batch size equal to 1 on two nodes.

Jul 27 '22 14:07 pai4451

I am not sure why I am getting the same error ^^ for batch size = 1. @pai4451 Any pointers?

What is your CUDA version and DeepSpeed? I personally had CUDA11.5 and DeepSpeed 0.7.0 installed from ds-inference/bloom-fix branch, and I can inference BLOOM with batch size equal to 1 on two nodes.

I am using CUDA-11.6 and deepspeed is built from master

Jul 27 '22 14:07 mayank31398

@mayank31398 Perhaps try the ds-inference/bloom-fix branch of deepspeed?

Jul 27 '22 14:07 asaparov

@mayank31398 Perhaps try the ds-inference/bloom-fix branch of deepspeed?

Ill try this today. thanks

Jul 27 '22 14:07 mayank31398

Actually, I just tried running with larger batch sizes (16 and 32) and it doesn't run into the "CUDA illegal memory access" error (as I did with batch size=2). Maybe it is intermittent? Or maybe something's wrong with batch size 2 specifically.

Jul 27 '22 19:07 asaparov

Actually, I just tried running with larger batch sizes (16 and 32) and it doesn't run into the "CUDA illegal memory access" error (as I did with batch size=2). Maybe it is intermittent? Or maybe something's wrong with batch size 2 specifically.

We (with @pai4451) tried batch_size from 8 to 2, all of them failed. but yet try batch_size > 8. Pai will test it today to see what happen in our side.

Jul 28 '22 01:07 pohunghuang-nctu

@asaparov I tried the inference script with batch sizes = 1, 2, 4, 8, 16, 32, 64 and 128. Only batch sizes equal 1 and 32 work, which is a bit surprising. Anyways we’ll have to wait someone to fix the issue in this repo.

Jul 28 '22 02:07 pai4451

Hi all,

There are some new changes merged at DeepSpeed master. Would you mind trying that? I have tried with batch 1 and 128 and both are working on my side (I ran it on 8 A100 80GB). I will try on A100-40G as well to make sure all is fine. Also, you can now generate MP-sharded checkpoints to load the model much faster. You can find more information in this PR: https://github.com/microsoft/DeepSpeed/pull/2132 Thanks, Reza

Jul 29 '22 06:07 RezaYazdaniAminabadi

@RezaYazdaniAminabadi could you give some hint (where to get the doc) about "generate MP-sharded checkpoints"? So far we have only the 70 .bin files downloaded from huggingface. Do you mean there's a tool re-formatting these 70 files into world-size pieces to speed up model loading? Thanks in advance.

Jul 29 '22 09:07 pohunghuang-nctu

Hi @pohunghuang-nctu

Sure, you need to pass save_mp_checkpoint_path to the init_inference method in order to save the tp-sharded checkpoints in the path you specified. You will see that after loading the checkpoint, DeepSpeed starts saving the new checkpoints, and you will eventually have the tp-sharded checkpoints. In addition, there will be a json config file saved in that path (like bloom_ds-inference-config.json) that you can pass as the checkpoint argument to init_inference in the next run. Note that you can remove save_mp_checkpoint_path after you save the tp-sharded checkpoints for the first time, so that DeepSpeed doesn't always save a new checkpoint for you.

Best, Reza

Jul 29 '22 16:07 RezaYazdaniAminabadi

@RezaYazdaniAminabadi I was testing with the newly merged code last night but still hit the illegal memory accesses intermittently on the larger batch sizes. It wasn't like throwing a dice though, it would work for like a half hour and then stop working for another block of time and then start working again.

For the first time I was able to use some larger batch sizes though (at least part of the time), so something seems to have improved.

EDIT: these tests were on 8x A100 80GB

Jul 29 '22 17:07 zcrypt0

I am glad you could run it with large batch now! :) I think this might be related to some cache allocation issues. We are working on optimizing that part too.

Jul 29 '22 18:07 RezaYazdaniAminabadi

Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch)