[BUGFIX] Save dist_checkpointing metadata on all nodes for multi-node training
## Description
Fixes a bug where `metadata.json` is saved only on global rank 0 during distributed checkpointing, causing load failures on other nodes in non-shared filesystem setups.
### Fix
Changed the save condition to:
```python
if int(os.environ.get("LOCAL_RANK", 0)) == 0:
save_config(...)
This ensures metadata.json is saved on each node (local rank 0), allowing successful checkpoint loading across all nodes.
Testing
Run dist_cp_save_load.py with torchrun on 2+ nodes:
- Confirm each node has its own
metadata.json - No
CheckpointingExceptionoccurs - Final log should show:
Loaded the disk checkpoint.
Fixes #1530
Thanks Pranaykarvi for the quick fix and testing!
I also want to learn more from Megatron team what's the design assumption about this metadata,
- whether it's assumed the metadata directory is hosted by a distributed file system(e.g. NFS)?
- Whether users have the responsibility to manage the metadata synch across node?
- Whether application side should have the handling to only load the metadata from master node rather than all nodes.
Marking as stale. No activity in 60 days.
This PR was closed because it has been inactive for 7 days since being marked as stale.
@sbhavani can we merge this fix? we're also running into this
bump here! would also like this to be merged