UsUsing MegatronCommOverlapCallback(tp_comm_overlap=True) causes segfault.
Describe the bug
A segmentation fault occurs when MegatronCommOverlapCallback is initialized with tp_comm_overlap=True. This specific configuration is adopted from https://github.com/NVIDIA/NeMo/blob/19fadb67b09ba94c55094d34df119d6f9c565068/nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py#L85.
Steps/Code to reproduce bug
Please list minimal steps or code snippet for us to be able to reproduce the bug.
The issue can be reproduced by running llama3_1_8b.py (see below) by the following command:
docker run --gpus all -it --rm -v /home/test_run:/workspace/test_run --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 -e CUDA_DEVICE_MAX_CONNECTIONS=1 nvcr.io/nvidia/nemo:25.04.rc2 /bin/bash -c "python test_run/llama3_1_8b.py"
llama3_1_8b.py:
"""Llama 3.1 8B training recipe."""
import os
from lightning.pytorch.loggers import TensorBoardLogger
from megatron.core.distributed import DistributedDataParallelConfig
from megatron.core.optimizer import OptimizerConfig
from nemo import lightning as nl
from nemo.collections import llm
from nemo.collections.llm.gpt.model.llama import Llama31Config8B
from nemo.collections.llm.gpt.model.llama import LlamaModel
from nemo.lightning.pytorch.callbacks.megatron_comm_overlap import MegatronCommOverlapCallback
from nemo.lightning.pytorch.optim import CosineAnnealingScheduler
from nemo.lightning.pytorch.optim import MegatronOptimizerModule
from nemo.utils.exp_manager import TimingCallback
import torch
def main():
data = llm.MockDataModule(
num_train_samples=1_000_000,
seq_length=8192,
global_batch_size=128,
micro_batch_size=1,
)
model_config = Llama31Config8B()
model = LlamaModel(model_config)
strategy = nl.MegatronStrategy(
tensor_model_parallel_size=8,
pipeline_model_parallel_size=1,
pipeline_dtype=torch.bfloat16,
virtual_pipeline_model_parallel_size=None,
context_parallel_size=1,
expert_model_parallel_size=1,
sequence_parallel=True,
account_for_embedding_in_pipeline_split=True,
account_for_loss_in_pipeline_split=True,
gradient_as_bucket_view=True,
ckpt_async_save=True,
ckpt_parallel_save=True,
ckpt_parallel_load=True,
ckpt_parallel_save_optim=True,
ckpt_load_strictness="log_all",
ddp=DistributedDataParallelConfig(
check_for_nan_in_grad=True,
grad_reduce_in_fp32=True,
overlap_grad_reduce=True,
overlap_param_gather=True,
average_in_collective=True,
),
)
# Combine to the trainer
trainer = nl.Trainer(
accelerator="gpu",
devices=8,
num_nodes=1,
max_steps=10,
limit_val_batches=1,
val_check_interval=5,
log_every_n_steps=1,
strategy=strategy,
# Will let nemo tune automatically
accumulate_grad_batches=1,
# Will use nemo's sampler
use_distributed_sampler=False,
plugins=nl.MegatronMixedPrecision(precision="bf16-mixed"),
# Will let NeMoLogger to setup checkpoint
enable_checkpointing=False,
callbacks=[TimingCallback(),
MegatronCommOverlapCallback(tp_comm_overlap=True)],
)
# Config the optimizer
opt_config = OptimizerConfig(
optimizer="adam",
lr=3e-4,
weight_decay=0.1,
bf16=True,
fp16=False,
adam_beta1=0.9,
adam_beta2=0.95,
adam_eps=1e-5,
use_distributed_optimizer=True,
clip_grad=1.0,
)
lr_scheduler = CosineAnnealingScheduler(
warmup_steps=2000,
constant_steps=0,
min_lr=3e-5,
)
opt = MegatronOptimizerModule(config=opt_config, lr_scheduler=lr_scheduler)
# Setup checkpoint and tensorboard for logger
ckpt = nl.ModelCheckpoint(
save_top_k=1,
# Generate a *-last ckpt copy (link) whenever a ckpt is saved.
# This is required when using auto resume.
save_last=True,
# Set to True if the final ckpt will be used by auto resume
save_optim_on_train_end=False,
filename="{val_loss:.2f}-{step}-{consumed_samples}",
)
tb = TensorBoardLogger(
save_dir="tensorboard", # The name of tfevents folder
name="", # No need further subfolder
)
logger = nl.NeMoLogger(
# The centralized dir for loggings, tensorboard, checkpoints
explicit_log_dir="/logs",
log_global_rank_0_only=True,
update_logger_directory=True,
# Remove this argument to disable checkpointing
ckpt=ckpt,
tensorboard=tb,
)
# Config auto resume
resume = nl.AutoResume(
# Force the training to resume from the last ckpt in log_dir if exists
resume_if_exists=True,
# Do not raise error if ckpt does not exist
resume_ignore_no_checkpoint=True,
)
# Call nl.trainer.fit
llm.pretrain(
model=model,
data=data,
trainer=trainer,
log=logger,
resume=resume,
optim=opt,
)
if __name__ == "__main__":
main()
The output log:
[NeMo I 2025-05-09 00:15:20 utils:507] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.float32, use_precision_aware_optimizer=False, main_grads_dtype=torch.float32, main_params_dtype=torch.float32, exp_avg_dtype=torch.float32, exp_avg_sq_dtype=torch.float32, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, optimizer_cpu_offload=False, optimizer_offload_fraction=0.0, use_torch_optimizer_for_cpu_offload=False, overlap_cpu_optimizer_d2h_h2d=False, pin_cpu_grads=True, pin_cpu_params=True, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')
[rank3]:[W509 00:15:20.209630345 ProcessGroupMPI.cpp:255] Warning: MPI was previously initialized. (function operator())
[1de2faca3781:446 :0:446] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
==== backtrace (tid: 446) ====
0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f6fc7db6654]
1 /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7f6fc7db684c]
2 /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7f6fc7db6a88]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7f7030db6330]
4 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_dpm_mark_dyncomm+0x60) [0x7f702faa6cf0]
5 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_set_nb+0x391) [0x7f702fa98331]
6 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_set+0x39) [0x7f702fa98669]
7 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_create+0x25b) [0x7f702fa98e9b]
8 /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Comm_create+0x1a) [0x7f702fad0a8a]
9 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI21createProcessGroupMPIESt6vectorIiSaIiEE+0xd9) [0x7f701c216889]
10 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xc2a18b) [0x7f702444918b]
11 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37badd) [0x7f7023b9aadd]
12 /usr/bin/python() [0x58208f]
13 /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
14 /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
15 /usr/bin/python() [0x54cccd]
16 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
17 /usr/bin/python() [0x54cccd]
18 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
19 /usr/bin/python() [0x54cd94]
20 /usr/bin/python(PyObject_Call+0x115) [0x54b3b5]
21 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
22 /usr/bin/python(PyEval_EvalCode+0x15b) [0x5d58eb]
23 /usr/bin/python() [0x608b42]
24 /usr/bin/python() [0x6b4e93]
25 /usr/bin/python(_PyRun_SimpleFileObject+0x1aa) [0x6b4bfa]
26 /usr/bin/python(_PyRun_AnyFileObject+0x4f) [0x6b4a2f]
27 /usr/bin/python(Py_RunMain+0x3b5) [0x6bca95]
28 /usr/bin/python(Py_BytesMain+0x2d) [0x6bc57d]
29 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f7030d9b1ca]
30 /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7f7030d9b28b]
31 /usr/bin/python(_start+0x25) [0x657ce5]
=================================
[rank2]:[W509 00:15:20.228850457 ProcessGroupMPI.cpp:255] Warning: MPI was previously initialized. (function operator())
[1de2faca3781:445 :0:445] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 445) ====
0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f499465e654]
1 /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7f499465e84c]
2 /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7f499465ea88]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7f49fd5fe330]
4 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_group_increment_proc_count+0x48) [0x7f49fc2f9148]
5 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_group_incl_plist+0xa9) [0x7f49fc2f9a29]
6 /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Group_incl+0x55) [0x7f49fc321755]
7 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI21createProcessGroupMPIESt6vectorIiSaIiEE+0xaf) [0x7f49e8a1685f]
8 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xc2a18b) [0x7f49f0c4918b]
9 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37badd) [0x7f49f039aadd]
10 /usr/bin/python() [0x58208f]
11 /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
12 /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
13 /usr/bin/python() [0x54cccd]
14 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
15 /usr/bin/python() [0x54cccd]
16 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
17 /usr/bin/python() [0x54cd94]
18 /usr/bin/python(PyObject_Call+0x115) [0x54b3b5]
19 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
20 /usr/bin/python(PyEval_EvalCode+0x15b) [0x5d58eb]
21 /usr/bin/python() [0x608b42]
22 /usr/bin/python() [0x6b4e93]
23 /usr/bin/python(_PyRun_SimpleFileObject+0x1aa) [0x6b4bfa]
24 /usr/bin/python(_PyRun_AnyFileObject+0x4f) [0x6b4a2f]
25 /usr/bin/python(Py_RunMain+0x3b5) [0x6bca95]
26 /usr/bin/python(Py_BytesMain+0x2d) [0x6bc57d]
27 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f49fd5e31ca]
28 /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7f49fd5e328b]
29 /usr/bin/python(_start+0x25) [0x657ce5]
=================================
[rank1]:[W509 00:15:20.233568550 ProcessGroupMPI.cpp:255] Warning: MPI was previously initialized. (function operator())
[1de2faca3781:444 :0:444] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[rank7]:[W509 00:15:20.233711786 ProcessGroupMPI.cpp:255] Warning: MPI was previously initialized. (function operator())
[1de2faca3781:450 :0:450] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
[rank5]:[W509 00:15:20.234094604 ProcessGroupMPI.cpp:255] Warning: MPI was previously initialized. (function operator())
[1de2faca3781:448 :0:448] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x100000008)
==== backtrace (tid: 444) ====
0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7fc9bfe5e654]
1 /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7fc9bfe5e84c]
2 /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7fc9bfe5ea88]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7fca28dd3330]
4 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_group_increment_proc_count+0x48) [0x7fca27ace148]
5 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_group_incl_plist+0xa9) [0x7fca27acea29]
6 /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Group_incl+0x55) [0x7fca27af6755]
7 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI21createProcessGroupMPIESt6vectorIiSaIiEE+0xaf) [0x7fca1421685f]
8 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xc2a18b) [0x7fca1c44918b]
9 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37badd) [0x7fca1bb9aadd]
10 /usr/bin/python() [0x58208f]
11 /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
12 /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
13 /usr/bin/python() [0x54cccd]
14 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
15 /usr/bin/python() [0x54cccd]
16 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
17 /usr/bin/python() [0x54cd94]
18 /usr/bin/python(PyObject_Call+0x115) [0x54b3b5]
19 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
20 /usr/bin/python(PyEval_EvalCode+0x15b) [0x5d58eb]
21 /usr/bin/python() [0x608b42]
22 /usr/bin/python() [0x6b4e93]
23 /usr/bin/python(_PyRun_SimpleFileObject+0x1aa) [0x6b4bfa]
24 /usr/bin/python(_PyRun_AnyFileObject+0x4f) [0x6b4a2f]
25 /usr/bin/python(Py_RunMain+0x3b5) [0x6bca95]
26 /usr/bin/python(Py_BytesMain+0x2d) [0x6bc57d]
27 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7fca28db81ca]
28 /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7fca28db828b]
29 /usr/bin/python(_start+0x25) [0x657ce5]
=================================
==== backtrace (tid: 450) ====
0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f5149708654]
1 /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7f514970884c]
2 /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7f5149708a88]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7f5155697330]
4 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_dpm_mark_dyncomm+0x60) [0x7f5154387cf0]
5 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_set_nb+0x391) [0x7f5154379331]
6 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_set+0x39) [0x7f5154379669]
7 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_create+0x25b) [0x7f5154379e9b]
8 /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Comm_create+0x1a) [0x7f51543b1a8a]
9 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI21createProcessGroupMPIESt6vectorIiSaIiEE+0xd9) [0x7f5140afb889]
10 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xc2a18b) [0x7f5148d2e18b]
11 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37badd) [0x7f514847fadd]
12 /usr/bin/python() [0x58208f]
13 /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
14 /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
15 /usr/bin/python() [0x54cccd]
16 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
17 /usr/bin/python() [0x54cccd]
18 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
19 /usr/bin/python() [0x54cd94]
20 /usr/bin/python(PyObject_Call+0x115) [0x54b3b5]
21 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
22 /usr/bin/python(PyEval_EvalCode+0x15b) [0x5d58eb]
23 /usr/bin/python() [0x608b42]
24 /usr/bin/python() [0x6b4e93]
25 /usr/bin/python(_PyRun_SimpleFileObject+0x1aa) [0x6b4bfa]
26 /usr/bin/python(_PyRun_AnyFileObject+0x4f) [0x6b4a2f]
27 /usr/bin/python(Py_RunMain+0x3b5) [0x6bca95]
28 /usr/bin/python(Py_BytesMain+0x2d) [0x6bc57d]
29 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f515567c1ca]
30 /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7f515567c28b]
31 /usr/bin/python(_start+0x25) [0x657ce5]
=================================
==== backtrace (tid: 448) ====
0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f8c0665e654]
1 /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7f8c0665e84c]
2 /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7f8c0665ea88]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7f8c6f5cf330]
4 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_group_increment_proc_count+0x48) [0x7f8c6e2ca148]
5 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_group_incl_plist+0xa9) [0x7f8c6e2caa29]
6 /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Group_incl+0x55) [0x7f8c6e2f2755]
7 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI21createProcessGroupMPIESt6vectorIiSaIiEE+0xaf) [0x7f8c5aa1685f]
8 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xc2a18b) [0x7f8c62c4918b]
9 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37badd) [0x7f8c6239aadd]
10 /usr/bin/python() [0x58208f]
11 /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
12 /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
13 /usr/bin/python() [0x54cccd]
14 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
15 /usr/bin/python() [0x54cccd]
16 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
17 /usr/bin/python() [0x54cd94]
18 /usr/bin/python(PyObject_Call+0x115) [0x54b3b5]
19 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
20 /usr/bin/python(PyEval_EvalCode+0x15b) [0x5d58eb]
21 /usr/bin/python() [0x608b42]
22 /usr/bin/python() [0x6b4e93]
23 /usr/bin/python(_PyRun_SimpleFileObject+0x1aa) [0x6b4bfa]
24 /usr/bin/python(_PyRun_AnyFileObject+0x4f) [0x6b4a2f]
25 /usr/bin/python(Py_RunMain+0x3b5) [0x6bca95]
26 /usr/bin/python(Py_BytesMain+0x2d) [0x6bc57d]
27 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f8c6f5b41ca]
28 /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7f8c6f5b428b]
29 /usr/bin/python(_start+0x25) [0x657ce5]
=================================
[rank6]:[W509 00:15:20.246922855 ProcessGroupMPI.cpp:255] Warning: MPI was previously initialized. (function operator())
[1de2faca3781:449 :0:449] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
==== backtrace (tid: 449) ====
0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f100765e654]
1 /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7f100765e84c]
2 /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7f100765ea88]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7f1070652330]
4 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_dpm_mark_dyncomm+0x60) [0x7f106f342cf0]
5 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_set_nb+0x391) [0x7f106f334331]
6 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_set+0x39) [0x7f106f334669]
7 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_create+0x25b) [0x7f106f334e9b]
8 /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Comm_create+0x1a) [0x7f106f36ca8a]
9 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI21createProcessGroupMPIESt6vectorIiSaIiEE+0xd9) [0x7f105ba16889]
10 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xc2a18b) [0x7f1063c4918b]
11 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37badd) [0x7f106339aadd]
12 /usr/bin/python() [0x58208f]
13 /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
14 /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
15 /usr/bin/python() [0x54cccd]
16 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
17 /usr/bin/python() [0x54cccd]
18 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
19 /usr/bin/python() [0x54cd94]
20 /usr/bin/python(PyObject_Call+0x115) [0x54b3b5]
21 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
22 /usr/bin/python(PyEval_EvalCode+0x15b) [0x5d58eb]
23 /usr/bin/python() [0x608b42]
24 /usr/bin/python() [0x6b4e93]
25 /usr/bin/python(_PyRun_SimpleFileObject+0x1aa) [0x6b4bfa]
26 /usr/bin/python(_PyRun_AnyFileObject+0x4f) [0x6b4a2f]
27 /usr/bin/python(Py_RunMain+0x3b5) [0x6bca95]
28 /usr/bin/python(Py_BytesMain+0x2d) [0x6bc57d]
29 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f10706371ca]
30 /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7f107063728b]
31 /usr/bin/python(_start+0x25) [0x657ce5]
=================================
[rank4]:[W509 00:15:20.248572833 ProcessGroupMPI.cpp:255] Warning: MPI was previously initialized. (function operator())
[1de2faca3781:447 :0:447] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 447) ====
0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f29b2e5e654]
1 /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7f29b2e5e84c]
2 /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7f29b2e5ea88]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7f2a1be49330]
4 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_group_increment_proc_count+0x48) [0x7f2a1ab44148]
5 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_group_incl_plist+0xa9) [0x7f2a1ab44a29]
6 /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Group_incl+0x55) [0x7f2a1ab6c755]
7 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI21createProcessGroupMPIESt6vectorIiSaIiEE+0xaf) [0x7f2a0721685f]
8 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xc2a18b) [0x7f2a0f44918b]
9 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37badd) [0x7f2a0eb9aadd]
10 /usr/bin/python() [0x58208f]
11 /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
12 /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
13 /usr/bin/python() [0x54cccd]
14 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
15 /usr/bin/python() [0x54cccd]
16 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
17 /usr/bin/python() [0x54cd94]
18 /usr/bin/python(PyObject_Call+0x115) [0x54b3b5]
19 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
20 /usr/bin/python(PyEval_EvalCode+0x15b) [0x5d58eb]
21 /usr/bin/python() [0x608b42]
22 /usr/bin/python() [0x6b4e93]
23 /usr/bin/python(_PyRun_SimpleFileObject+0x1aa) [0x6b4bfa]
24 /usr/bin/python(_PyRun_AnyFileObject+0x4f) [0x6b4a2f]
25 /usr/bin/python(Py_RunMain+0x3b5) [0x6bca95]
26 /usr/bin/python(Py_BytesMain+0x2d) [0x6bc57d]
27 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f2a1be2e1ca]
28 /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7f2a1be2e28b]
29 /usr/bin/python(_start+0x25) [0x657ce5]
=================================
[NeMo W 2025-05-09 00:15:21 nemo_logging:405] Tensor parallel overlap: No overlap config provided. Initializing TP comm overlap with the default config.
[rank0]:[W509 00:15:21.793312440 ProcessGroupMPI.cpp:255] Warning: MPI was previously initialized. (function operator())
[1de2faca3781:1 :0:1] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
==== backtrace (tid: 1) ====
0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f021bd83654]
1 /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7f021bd8384c]
2 /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7f021bd83a88]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7f021e064330]
4 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_dpm_mark_dyncomm+0x60) [0x7f021cd54cf0]
5 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_set_nb+0x391) [0x7f021cd46331]
6 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_set+0x39) [0x7f021cd46669]
7 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_create+0x25b) [0x7f021cd46e9b]
8 /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Comm_create+0x1a) [0x7f021cd7ea8a]
9 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI21createProcessGroupMPIESt6vectorIiSaIiEE+0xd9) [0x7f0209416889]
10 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xc2a18b) [0x7f021164918b]
11 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37badd) [0x7f0210d9aadd]
12 python() [0x58208f]
13 python(_PyObject_MakeTpCall+0x75) [0x549185]
14 python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
15 python() [0x54cccd]
16 python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
17 python() [0x54cccd]
18 python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
19 python() [0x54cd94]
20 python(PyObject_Call+0x115) [0x54b3b5]
21 python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
22 python(PyEval_EvalCode+0x15b) [0x5d58eb]
23 python() [0x608b42]
24 python() [0x6b4e93]
25 python(_PyRun_SimpleFileObject+0x1aa) [0x6b4bfa]
26 python(_PyRun_AnyFileObject+0x4f) [0x6b4a2f]
27 python(Py_RunMain+0x3b5) [0x6bca95]
28 python(Py_BytesMain+0x2d) [0x6bc57d]
29 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f021e0491ca]
30 /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7f021e04928b]
31 python(_start+0x25) [0x657ce5]
Expected behavior
Training completes successfully when tp_comm_overlap is set to False in MegatronCommOverlapCallback. However, setting tp_comm_overlap causes segfault.
Environment overview (please complete the following information)
Running in a NeMo's docker container (ver 25.04.rc2) on GCP A3 High.
it looks like a mpi bootstrap issue, previously this code path worked, so im not sure what changed- probably we can just switch to nccl or gloo bootstrap
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
ai generated, please verify
The segmentation fault occurs because of a conflict in MPI initialization when using MegatronCommOverlapCallback with tp_comm_overlap=True. The error happens during tensor parallel communication setup when the default MPI bootstrap backend tries to create a new MPI process group, but MPI was already initialized elsewhere in the system.
To fix this issue, you need to specify a different bootstrap backend when initializing the MegatronCommOverlapCallback. Modify your code to use either NCCL or GLOO instead of the default MPI backend:
callbacks=[TimingCallback(),
MegatronCommOverlapCallback(tp_comm_overlap=True, tp_comm_bootstrap_backend="nccl")]
This change will instruct the tensor parallel communication overlap system to use NCCL for communication instead of MPI, avoiding the conflict that causes the segmentation fault.