torchtitan
torchtitan copied to clipboard
NCCL kernels take longer when composing CUDAGraph with SimpleFSDP
Reported by @BoyuanFeng and @galv for the PR: https://github.com/pytorch/torchtitan/pull/2050.
Repro instructions:
# WITHOUT cudagraph
USE_EXPANDABLE_SEGMENTS=False NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4
# WITH cudagraph
USE_EXPANDABLE_SEGMENTS=False NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes cudagraph
# the trace would be stored in torchtitan/outputs/profile_trace