NeMo
NeMo copied to clipboard
[automodel][draft] Integrate Megatron Custom FSDP2 into NeMo Automodel.
Summary
Status: DRAFT - Tentatively pulling in CFSDP2 source code from a Megatron branch for Automodel specifically until the NeMo-Megatron path works again.
- Integrates custom FSDP2 into NeMo Automodel in close collaboration with @shjwudp.
- Torch-Native Automodel Support for CFSDP2 in Megatron: https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/merge_requests/3150 (Working Branch: jianbinc/custom_fsdp_dtensor_ckpt) needs to be merged before we merge this PR!
TODO
- Test TP and CP with CFSDP2 in Automodel after implementing support for DTensor buffering with CFSDP2.
Collection: nemo.lightning.pytorch.strategies.fsdp2_strategy
Changelog
- Added options and utilities to wrap the Automodel in
FSDPwhich shards and communicates optimizer state, gradients, and model parameters using dynamically allocated tensors.
Usage
- To use CFSDP2, use the
--cfsdp2argument and populate--cfsdp2-unit-moduleswith the string class-paths of all layers that should be managed by CFSDP2, e.g.--cfsdp2-unit-modules transformers.models.llama.modeling_llama.LlamaDecoderLayer.
torchrun --nproc-per-node 8 examples/llm/sft/automodel.py --strategy fsdp2 --num-nodes 1 --devices 8 --dp-size 8 --cp-size 1 --global-batch-size 32 --micro-batch-size 1 --accumulate_grad_batches 4 --lr 3e-6 --seq-length 8192 --max-steps 10000 --log-every-n-steps 1 --limit-val-batches 0.025 --trust-remote-code --attn-implementation flash_attention_2 --use-chunked-ce --cfsdp2 --cfsdp2-unit-modules transformers.models.llama.modeling_llama.LlamaDecoderLayer