[automodel][draft] Integrate Megatron Custom FSDP2 into NeMo Automodel.

Open cspades opened this issue 8 months ago • 0 comments

Summary

Status: DRAFT - Tentatively pulling in CFSDP2 source code from a Megatron branch for Automodel specifically until the NeMo-Megatron path works again.

Integrates custom FSDP2 into NeMo Automodel in close collaboration with @shjwudp.
- Torch-Native Automodel Support for CFSDP2 in Megatron: https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/merge_requests/3150 (Working Branch: jianbinc/custom_fsdp_dtensor_ckpt) needs to be merged before we merge this PR!

TODO

Test TP and CP with CFSDP2 in Automodel after implementing support for DTensor buffering with CFSDP2.

Collection: nemo.lightning.pytorch.strategies.fsdp2_strategy

Changelog

Added options and utilities to wrap the Automodel in FSDP which shards and communicates optimizer state, gradients, and model parameters using dynamically allocated tensors.

Usage

To use CFSDP2, use the --cfsdp2 argument and populate --cfsdp2-unit-modules with the string class-paths of all layers that should be managed by CFSDP2, e.g. --cfsdp2-unit-modules transformers.models.llama.modeling_llama.LlamaDecoderLayer.

torchrun --nproc-per-node 8 examples/llm/sft/automodel.py --strategy fsdp2 --num-nodes 1 --devices 8 --dp-size 8 --cp-size 1 --global-batch-size 32 --micro-batch-size 1 --accumulate_grad_batches 4 --lr 3e-6 --seq-length 8192 --max-steps 10000 --log-every-n-steps 1 --limit-val-batches 0.025 --trust-remote-code --attn-implementation flash_attention_2 --use-chunked-ce --cfsdp2 --cfsdp2-unit-modules transformers.models.llama.modeling_llama.LlamaDecoderLayer

Apr 25 '25 00:04 cspades