BUG: system.init_data_model_parallel() Prevents Nsight Systems (nsys) from Tracing GPU Hardware in Distributed Mode
Environment Details
Tutel Version: v0.4 (Installed from @main branch)
PyTorch Version: 2.5
CUDA Driver Version: 550.54.15 (Supports CUDA 12.4)
Nsight Systems CLI Version: 2023.4.4
GPU Configuration: 2x A100 80GB
Observed Behavior
When running the distributed benchmark (tutel/examples/helloworld.py) launched via torch.distributed.run and profiled with nsys, the resulting report fails to collect all GPU Hardware (HW) timelines.
The report successfully traces CPU activity and high-level NCCL API calls.
The timelines for GPU 0 / GPU 1, Kernels (GEMM computation), and Memory [C2C] (All-to-All communication) are missing.
Expected Behavior
The nsys report should display the Kernels and Memory [C2C] hardware timelines, running concurrently to analyze communication-computation overlap.
Steps to Reproduce (Minimal Example)
The issue is reproducible by launching the original script with required profiling flags:
- Launch Command (Failing Trace)
nsys profile --trace-fork-before-exec=true -o tutel_fail.nsys
/path/to/python3 -m torch.distributed.run --nproc-per-node=2 ~/tutel_examples/helloworld.py - Workaround / Proof of Concept (Successful Trace) The issue is resolved by replacing Tutel's custom initialization with standard PyTorch initialization.
Change helloworld.py (Approx. Lines 53-56):
The block containing parallel_env = system.init_data_model_parallel(...) must be replaced by standard initialization.
# CODE BLOCK TO REPLACE TUTEL'S INIT:
import torch.distributed as dist
import os
# ... (inside script)
dist.init_process_group(backend='nccl' if args.device == 'cuda' else 'gloo')
local_rank = int(os.environ['LOCAL_RANK'])
device = torch.device(f'cuda:{local_rank}')
torch.cuda.set_device(device)
# Define Tutel's helper variables here (dist_rank, dist_world_size, dist_print = print)
Root Cause Analysis
The problem is a conflict between Tutel's custom initialization and the CUDA Profiling Tools Interface (CUPTI) during multi-process injection.
tutel.system.init_data_model_parallel() executes advanced CUDA/NCCL setup very early in the child process lifetime.
When nsys attempts to hook the hardware via CUPTI, Tutel's non-standard, early initialization conflicts with the timing required for CUPTI to successfully mount the hardware sampling probes into the child processes.
This results in a silent failure where nsys believes it has attached, but the low-level hardware tracing mechanism is disabled for the workers.
Suggested Fix
It is recommended to review the implementation of tutel/system.py's init_data_model_parallel() to ensure that custom CUDA operations or stream/group creations do not prematurely interfere with the CUPTI (CUDA Profiling Tools Interface) initialization window.
Thank you. This command doesn't seem to get into issues:
nsys profile --trace-fork-before-exec=true -o tutel_fail.nsys python3 -m torch.distributed.run --nproc-per-node=2 -m tutel.examples.helloworld
Tutel's initialization still uses torch's naive distributed initialization, but it creates sub-groups instead of a naive global processing group.
Do you mean nsys doesn't trace the initialization process? For your requirement, can you help to check if this one-line change solves your issue?
Please replace if is_distributed: to if False: `https://github.com/microsoft/Tutel/blob/5d2a24f0fcb6a3012de5fc4744ac2b4cc81b7fcd/tutel/impls/communicate.py#L113