NCCL kernels take longer when composing CUDAGraph with SimpleFSDP

Open syed-ahmed opened this issue 4 weeks ago • 1 comments

Reported by @BoyuanFeng and @galv for the PR: https://github.com/pytorch/torchtitan/pull/2050.

Repro instructions:

# WITHOUT cudagraph
USE_EXPANDABLE_SEGMENTS=False NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4

# WITH cudagraph
USE_EXPANDABLE_SEGMENTS=False NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4  --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes cudagraph

# the trace would be stored in torchtitan/outputs/profile_trace

Nov 20 '25 21:11 syed-ahmed