NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

UsUsing MegatronCommOverlapCallback(tp_comm_overlap=True) causes segfault.

Open jiuqiant opened this issue 8 months ago • 2 comments

Describe the bug

A segmentation fault occurs when MegatronCommOverlapCallback is initialized with tp_comm_overlap=True. This specific configuration is adopted from https://github.com/NVIDIA/NeMo/blob/19fadb67b09ba94c55094d34df119d6f9c565068/nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py#L85.

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

The issue can be reproduced by running llama3_1_8b.py (see below) by the following command:

docker run --gpus all -it --rm -v  /home/test_run:/workspace/test_run --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 -e CUDA_DEVICE_MAX_CONNECTIONS=1  nvcr.io/nvidia/nemo:25.04.rc2 /bin/bash -c "python test_run/llama3_1_8b.py"

llama3_1_8b.py:

"""Llama 3.1 8B training recipe."""

import os

from lightning.pytorch.loggers import TensorBoardLogger
from megatron.core.distributed import DistributedDataParallelConfig
from megatron.core.optimizer import OptimizerConfig
from nemo import lightning as nl
from nemo.collections import llm
from nemo.collections.llm.gpt.model.llama import Llama31Config8B
from nemo.collections.llm.gpt.model.llama import LlamaModel
from nemo.lightning.pytorch.callbacks.megatron_comm_overlap import MegatronCommOverlapCallback
from nemo.lightning.pytorch.optim import CosineAnnealingScheduler
from nemo.lightning.pytorch.optim import MegatronOptimizerModule
from nemo.utils.exp_manager import TimingCallback
import torch


def main():
  data = llm.MockDataModule(
      num_train_samples=1_000_000,
      seq_length=8192,
      global_batch_size=128,
      micro_batch_size=1,
  )

  model_config = Llama31Config8B()
  model = LlamaModel(model_config)

  strategy = nl.MegatronStrategy(
      tensor_model_parallel_size=8,
      pipeline_model_parallel_size=1,
      pipeline_dtype=torch.bfloat16,
      virtual_pipeline_model_parallel_size=None,
      context_parallel_size=1,
      expert_model_parallel_size=1,
      sequence_parallel=True,
      account_for_embedding_in_pipeline_split=True,
      account_for_loss_in_pipeline_split=True,
      gradient_as_bucket_view=True,
      ckpt_async_save=True,
      ckpt_parallel_save=True,
      ckpt_parallel_load=True,
      ckpt_parallel_save_optim=True,
      ckpt_load_strictness="log_all",
      ddp=DistributedDataParallelConfig(
          check_for_nan_in_grad=True,
          grad_reduce_in_fp32=True,
          overlap_grad_reduce=True,
          overlap_param_gather=True,
          average_in_collective=True,
      ),
  )

  # Combine to the trainer
  trainer = nl.Trainer(
      accelerator="gpu",
      devices=8,
      num_nodes=1,
      max_steps=10,
      limit_val_batches=1,
      val_check_interval=5,
      log_every_n_steps=1,
      strategy=strategy,
      # Will let nemo tune automatically
      accumulate_grad_batches=1,
      # Will use nemo's sampler
      use_distributed_sampler=False,
      plugins=nl.MegatronMixedPrecision(precision="bf16-mixed"),
      # Will let NeMoLogger to setup checkpoint
      enable_checkpointing=False,
      callbacks=[TimingCallback(), 
                 MegatronCommOverlapCallback(tp_comm_overlap=True)],
  )

  # Config the optimizer
  opt_config = OptimizerConfig(
      optimizer="adam",
      lr=3e-4,
      weight_decay=0.1,
      bf16=True,
      fp16=False,
      adam_beta1=0.9,
      adam_beta2=0.95,
      adam_eps=1e-5,
      use_distributed_optimizer=True,
      clip_grad=1.0,
  )
  lr_scheduler = CosineAnnealingScheduler(
      warmup_steps=2000,
      constant_steps=0,
      min_lr=3e-5,
  )
  opt = MegatronOptimizerModule(config=opt_config, lr_scheduler=lr_scheduler)

  # Setup checkpoint and tensorboard for logger
  ckpt = nl.ModelCheckpoint(
      save_top_k=1,
      # Generate a *-last ckpt copy (link) whenever a ckpt is saved.
      # This is required when using auto resume.
      save_last=True,
      # Set to True if the final ckpt will be used by auto resume
      save_optim_on_train_end=False,
      filename="{val_loss:.2f}-{step}-{consumed_samples}",
  )
  tb = TensorBoardLogger(
      save_dir="tensorboard",  # The name of tfevents folder
      name="",  # No need further subfolder
  )
  logger = nl.NeMoLogger(
      # The centralized dir for loggings, tensorboard, checkpoints
      explicit_log_dir="/logs",
      log_global_rank_0_only=True,
      update_logger_directory=True,
      # Remove this argument to disable checkpointing
      ckpt=ckpt,
      tensorboard=tb,
  )

  # Config auto resume
  resume = nl.AutoResume(
      # Force the training to resume from the last ckpt in log_dir if exists
      resume_if_exists=True,
      # Do not raise error if ckpt does not exist
      resume_ignore_no_checkpoint=True,
  )

  # Call nl.trainer.fit
  llm.pretrain(
      model=model,
      data=data,
      trainer=trainer,
      log=logger,
      resume=resume,
      optim=opt,
  )


if __name__ == "__main__":
  main()

The output log:

[NeMo I 2025-05-09 00:15:20 utils:507] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.float32, use_precision_aware_optimizer=False, main_grads_dtype=torch.float32, main_params_dtype=torch.float32, exp_avg_dtype=torch.float32, exp_avg_sq_dtype=torch.float32, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, optimizer_cpu_offload=False, optimizer_offload_fraction=0.0, use_torch_optimizer_for_cpu_offload=False, overlap_cpu_optimizer_d2h_h2d=False, pin_cpu_grads=True, pin_cpu_params=True, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')
[rank3]:[W509 00:15:20.209630345 ProcessGroupMPI.cpp:255] Warning: MPI was previously initialized. (function operator())
[1de2faca3781:446  :0:446] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
==== backtrace (tid:    446) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f6fc7db6654]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7f6fc7db684c]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7f6fc7db6a88]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7f7030db6330]
 4  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_dpm_mark_dyncomm+0x60) [0x7f702faa6cf0]
 5  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_set_nb+0x391) [0x7f702fa98331]
 6  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_set+0x39) [0x7f702fa98669]
 7  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_create+0x25b) [0x7f702fa98e9b]
 8  /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Comm_create+0x1a) [0x7f702fad0a8a]
 9  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI21createProcessGroupMPIESt6vectorIiSaIiEE+0xd9) [0x7f701c216889]
10  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xc2a18b) [0x7f702444918b]
11  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37badd) [0x7f7023b9aadd]
12  /usr/bin/python() [0x58208f]
13  /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
14  /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
15  /usr/bin/python() [0x54cccd]
16  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
17  /usr/bin/python() [0x54cccd]
18  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
19  /usr/bin/python() [0x54cd94]
20  /usr/bin/python(PyObject_Call+0x115) [0x54b3b5]
21  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
22  /usr/bin/python(PyEval_EvalCode+0x15b) [0x5d58eb]
23  /usr/bin/python() [0x608b42]
24  /usr/bin/python() [0x6b4e93]
25  /usr/bin/python(_PyRun_SimpleFileObject+0x1aa) [0x6b4bfa]
26  /usr/bin/python(_PyRun_AnyFileObject+0x4f) [0x6b4a2f]
27  /usr/bin/python(Py_RunMain+0x3b5) [0x6bca95]
28  /usr/bin/python(Py_BytesMain+0x2d) [0x6bc57d]
29  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f7030d9b1ca]
30  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7f7030d9b28b]
31  /usr/bin/python(_start+0x25) [0x657ce5]
=================================
[rank2]:[W509 00:15:20.228850457 ProcessGroupMPI.cpp:255] Warning: MPI was previously initialized. (function operator())
[1de2faca3781:445  :0:445] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:    445) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f499465e654]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7f499465e84c]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7f499465ea88]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7f49fd5fe330]
 4  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_group_increment_proc_count+0x48) [0x7f49fc2f9148]
 5  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_group_incl_plist+0xa9) [0x7f49fc2f9a29]
 6  /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Group_incl+0x55) [0x7f49fc321755]
 7  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI21createProcessGroupMPIESt6vectorIiSaIiEE+0xaf) [0x7f49e8a1685f]
 8  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xc2a18b) [0x7f49f0c4918b]
 9  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37badd) [0x7f49f039aadd]
10  /usr/bin/python() [0x58208f]
11  /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
12  /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
13  /usr/bin/python() [0x54cccd]
14  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
15  /usr/bin/python() [0x54cccd]
16  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
17  /usr/bin/python() [0x54cd94]
18  /usr/bin/python(PyObject_Call+0x115) [0x54b3b5]
19  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
20  /usr/bin/python(PyEval_EvalCode+0x15b) [0x5d58eb]
21  /usr/bin/python() [0x608b42]
22  /usr/bin/python() [0x6b4e93]
23  /usr/bin/python(_PyRun_SimpleFileObject+0x1aa) [0x6b4bfa]
24  /usr/bin/python(_PyRun_AnyFileObject+0x4f) [0x6b4a2f]
25  /usr/bin/python(Py_RunMain+0x3b5) [0x6bca95]
26  /usr/bin/python(Py_BytesMain+0x2d) [0x6bc57d]
27  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f49fd5e31ca]
28  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7f49fd5e328b]
29  /usr/bin/python(_start+0x25) [0x657ce5]
=================================
[rank1]:[W509 00:15:20.233568550 ProcessGroupMPI.cpp:255] Warning: MPI was previously initialized. (function operator())
[1de2faca3781:444  :0:444] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[rank7]:[W509 00:15:20.233711786 ProcessGroupMPI.cpp:255] Warning: MPI was previously initialized. (function operator())
[1de2faca3781:450  :0:450] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
[rank5]:[W509 00:15:20.234094604 ProcessGroupMPI.cpp:255] Warning: MPI was previously initialized. (function operator())
[1de2faca3781:448  :0:448] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x100000008)
==== backtrace (tid:    444) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7fc9bfe5e654]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7fc9bfe5e84c]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7fc9bfe5ea88]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7fca28dd3330]
 4  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_group_increment_proc_count+0x48) [0x7fca27ace148]
 5  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_group_incl_plist+0xa9) [0x7fca27acea29]
 6  /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Group_incl+0x55) [0x7fca27af6755]
 7  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI21createProcessGroupMPIESt6vectorIiSaIiEE+0xaf) [0x7fca1421685f]
 8  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xc2a18b) [0x7fca1c44918b]
 9  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37badd) [0x7fca1bb9aadd]
10  /usr/bin/python() [0x58208f]
11  /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
12  /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
13  /usr/bin/python() [0x54cccd]
14  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
15  /usr/bin/python() [0x54cccd]
16  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
17  /usr/bin/python() [0x54cd94]
18  /usr/bin/python(PyObject_Call+0x115) [0x54b3b5]
19  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
20  /usr/bin/python(PyEval_EvalCode+0x15b) [0x5d58eb]
21  /usr/bin/python() [0x608b42]
22  /usr/bin/python() [0x6b4e93]
23  /usr/bin/python(_PyRun_SimpleFileObject+0x1aa) [0x6b4bfa]
24  /usr/bin/python(_PyRun_AnyFileObject+0x4f) [0x6b4a2f]
25  /usr/bin/python(Py_RunMain+0x3b5) [0x6bca95]
26  /usr/bin/python(Py_BytesMain+0x2d) [0x6bc57d]
27  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7fca28db81ca]
28  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7fca28db828b]
29  /usr/bin/python(_start+0x25) [0x657ce5]
=================================
==== backtrace (tid:    450) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f5149708654]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7f514970884c]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7f5149708a88]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7f5155697330]
 4  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_dpm_mark_dyncomm+0x60) [0x7f5154387cf0]
 5  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_set_nb+0x391) [0x7f5154379331]
 6  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_set+0x39) [0x7f5154379669]
 7  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_create+0x25b) [0x7f5154379e9b]
 8  /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Comm_create+0x1a) [0x7f51543b1a8a]
 9  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI21createProcessGroupMPIESt6vectorIiSaIiEE+0xd9) [0x7f5140afb889]
10  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xc2a18b) [0x7f5148d2e18b]
11  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37badd) [0x7f514847fadd]
12  /usr/bin/python() [0x58208f]
13  /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
14  /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
15  /usr/bin/python() [0x54cccd]
16  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
17  /usr/bin/python() [0x54cccd]
18  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
19  /usr/bin/python() [0x54cd94]
20  /usr/bin/python(PyObject_Call+0x115) [0x54b3b5]
21  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
22  /usr/bin/python(PyEval_EvalCode+0x15b) [0x5d58eb]
23  /usr/bin/python() [0x608b42]
24  /usr/bin/python() [0x6b4e93]
25  /usr/bin/python(_PyRun_SimpleFileObject+0x1aa) [0x6b4bfa]
26  /usr/bin/python(_PyRun_AnyFileObject+0x4f) [0x6b4a2f]
27  /usr/bin/python(Py_RunMain+0x3b5) [0x6bca95]
28  /usr/bin/python(Py_BytesMain+0x2d) [0x6bc57d]
29  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f515567c1ca]
30  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7f515567c28b]
31  /usr/bin/python(_start+0x25) [0x657ce5]
=================================
==== backtrace (tid:    448) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f8c0665e654]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7f8c0665e84c]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7f8c0665ea88]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7f8c6f5cf330]
 4  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_group_increment_proc_count+0x48) [0x7f8c6e2ca148]
 5  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_group_incl_plist+0xa9) [0x7f8c6e2caa29]
 6  /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Group_incl+0x55) [0x7f8c6e2f2755]
 7  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI21createProcessGroupMPIESt6vectorIiSaIiEE+0xaf) [0x7f8c5aa1685f]
 8  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xc2a18b) [0x7f8c62c4918b]
 9  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37badd) [0x7f8c6239aadd]
10  /usr/bin/python() [0x58208f]
11  /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
12  /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
13  /usr/bin/python() [0x54cccd]
14  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
15  /usr/bin/python() [0x54cccd]
16  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
17  /usr/bin/python() [0x54cd94]
18  /usr/bin/python(PyObject_Call+0x115) [0x54b3b5]
19  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
20  /usr/bin/python(PyEval_EvalCode+0x15b) [0x5d58eb]
21  /usr/bin/python() [0x608b42]
22  /usr/bin/python() [0x6b4e93]
23  /usr/bin/python(_PyRun_SimpleFileObject+0x1aa) [0x6b4bfa]
24  /usr/bin/python(_PyRun_AnyFileObject+0x4f) [0x6b4a2f]
25  /usr/bin/python(Py_RunMain+0x3b5) [0x6bca95]
26  /usr/bin/python(Py_BytesMain+0x2d) [0x6bc57d]
27  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f8c6f5b41ca]
28  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7f8c6f5b428b]
29  /usr/bin/python(_start+0x25) [0x657ce5]
=================================
[rank6]:[W509 00:15:20.246922855 ProcessGroupMPI.cpp:255] Warning: MPI was previously initialized. (function operator())
[1de2faca3781:449  :0:449] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
==== backtrace (tid:    449) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f100765e654]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7f100765e84c]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7f100765ea88]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7f1070652330]
 4  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_dpm_mark_dyncomm+0x60) [0x7f106f342cf0]
 5  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_set_nb+0x391) [0x7f106f334331]
 6  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_set+0x39) [0x7f106f334669]
 7  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_create+0x25b) [0x7f106f334e9b]
 8  /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Comm_create+0x1a) [0x7f106f36ca8a]
 9  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI21createProcessGroupMPIESt6vectorIiSaIiEE+0xd9) [0x7f105ba16889]
10  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xc2a18b) [0x7f1063c4918b]
11  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37badd) [0x7f106339aadd]
12  /usr/bin/python() [0x58208f]
13  /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
14  /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
15  /usr/bin/python() [0x54cccd]
16  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
17  /usr/bin/python() [0x54cccd]
18  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
19  /usr/bin/python() [0x54cd94]
20  /usr/bin/python(PyObject_Call+0x115) [0x54b3b5]
21  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
22  /usr/bin/python(PyEval_EvalCode+0x15b) [0x5d58eb]
23  /usr/bin/python() [0x608b42]
24  /usr/bin/python() [0x6b4e93]
25  /usr/bin/python(_PyRun_SimpleFileObject+0x1aa) [0x6b4bfa]
26  /usr/bin/python(_PyRun_AnyFileObject+0x4f) [0x6b4a2f]
27  /usr/bin/python(Py_RunMain+0x3b5) [0x6bca95]
28  /usr/bin/python(Py_BytesMain+0x2d) [0x6bc57d]
29  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f10706371ca]
30  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7f107063728b]
31  /usr/bin/python(_start+0x25) [0x657ce5]
=================================
[rank4]:[W509 00:15:20.248572833 ProcessGroupMPI.cpp:255] Warning: MPI was previously initialized. (function operator())
[1de2faca3781:447  :0:447] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:    447) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f29b2e5e654]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7f29b2e5e84c]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7f29b2e5ea88]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7f2a1be49330]
 4  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_group_increment_proc_count+0x48) [0x7f2a1ab44148]
 5  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_group_incl_plist+0xa9) [0x7f2a1ab44a29]
 6  /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Group_incl+0x55) [0x7f2a1ab6c755]
 7  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI21createProcessGroupMPIESt6vectorIiSaIiEE+0xaf) [0x7f2a0721685f]
 8  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xc2a18b) [0x7f2a0f44918b]
 9  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37badd) [0x7f2a0eb9aadd]
10  /usr/bin/python() [0x58208f]
11  /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
12  /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
13  /usr/bin/python() [0x54cccd]
14  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
15  /usr/bin/python() [0x54cccd]
16  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
17  /usr/bin/python() [0x54cd94]
18  /usr/bin/python(PyObject_Call+0x115) [0x54b3b5]
19  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
20  /usr/bin/python(PyEval_EvalCode+0x15b) [0x5d58eb]
21  /usr/bin/python() [0x608b42]
22  /usr/bin/python() [0x6b4e93]
23  /usr/bin/python(_PyRun_SimpleFileObject+0x1aa) [0x6b4bfa]
24  /usr/bin/python(_PyRun_AnyFileObject+0x4f) [0x6b4a2f]
25  /usr/bin/python(Py_RunMain+0x3b5) [0x6bca95]
26  /usr/bin/python(Py_BytesMain+0x2d) [0x6bc57d]
27  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f2a1be2e1ca]
28  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7f2a1be2e28b]
29  /usr/bin/python(_start+0x25) [0x657ce5]
=================================
[NeMo W 2025-05-09 00:15:21 nemo_logging:405] Tensor parallel overlap: No overlap config provided. Initializing TP comm overlap with the default config.
[rank0]:[W509 00:15:21.793312440 ProcessGroupMPI.cpp:255] Warning: MPI was previously initialized. (function operator())
[1de2faca3781:1    :0:1] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
==== backtrace (tid:      1) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f021bd83654]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7f021bd8384c]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7f021bd83a88]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7f021e064330]
 4  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_dpm_mark_dyncomm+0x60) [0x7f021cd54cf0]
 5  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_set_nb+0x391) [0x7f021cd46331]
 6  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_set+0x39) [0x7f021cd46669]
 7  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_create+0x25b) [0x7f021cd46e9b]
 8  /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Comm_create+0x1a) [0x7f021cd7ea8a]
 9  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI21createProcessGroupMPIESt6vectorIiSaIiEE+0xd9) [0x7f0209416889]
10  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xc2a18b) [0x7f021164918b]
11  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37badd) [0x7f0210d9aadd]
12  python() [0x58208f]
13  python(_PyObject_MakeTpCall+0x75) [0x549185]
14  python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
15  python() [0x54cccd]
16  python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
17  python() [0x54cccd]
18  python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
19  python() [0x54cd94]
20  python(PyObject_Call+0x115) [0x54b3b5]
21  python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
22  python(PyEval_EvalCode+0x15b) [0x5d58eb]
23  python() [0x608b42]
24  python() [0x6b4e93]
25  python(_PyRun_SimpleFileObject+0x1aa) [0x6b4bfa]
26  python(_PyRun_AnyFileObject+0x4f) [0x6b4a2f]
27  python(Py_RunMain+0x3b5) [0x6bca95]
28  python(Py_BytesMain+0x2d) [0x6bc57d]
29  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f021e0491ca]
30  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7f021e04928b]
31  python(_start+0x25) [0x657ce5]

Expected behavior

Training completes successfully when tp_comm_overlap is set to False in MegatronCommOverlapCallback. However, setting tp_comm_overlap causes segfault.

Environment overview (please complete the following information)

Running in a NeMo's docker container (ver 25.04.rc2) on GCP A3 High.

jiuqiant avatar May 09 '25 00:05 jiuqiant

it looks like a mpi bootstrap issue, previously this code path worked, so im not sure what changed- probably we can just switch to nccl or gloo bootstrap

jiemingz avatar May 12 '25 13:05 jiemingz

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jun 13 '25 02:06 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Jun 21 '25 02:06 github-actions[bot]

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Sep 04 '25 02:09 github-actions[bot]

ai generated, please verify

The segmentation fault occurs because of a conflict in MPI initialization when using MegatronCommOverlapCallback with tp_comm_overlap=True. The error happens during tensor parallel communication setup when the default MPI bootstrap backend tries to create a new MPI process group, but MPI was already initialized elsewhere in the system.

To fix this issue, you need to specify a different bootstrap backend when initializing the MegatronCommOverlapCallback. Modify your code to use either NCCL or GLOO instead of the default MPI backend:

callbacks=[TimingCallback(), 
           MegatronCommOverlapCallback(tp_comm_overlap=True, tp_comm_bootstrap_backend="nccl")]

This change will instruct the tensor parallel communication overlap system to use NCCL for communication instead of MPI, avoiding the conflict that causes the segmentation fault.

zhenyih avatar Sep 06 '25 21:09 zhenyih