Megatron-LM [BUGFIX] Save dist_checkpointing metadata on all nodes for multi-node training

## Description

Fixes a bug where `metadata.json` is saved only on global rank 0 during distributed checkpointing, causing load failures on other nodes in non-shared filesystem setups.

### Fix

Changed the save condition to:

```python
if int(os.environ.get("LOCAL_RANK", 0)) == 0:
    save_config(...)

This ensures metadata.json is saved on each node (local rank 0), allowing successful checkpoint loading across all nodes.

Testing

Run dist_cp_save_load.py with torchrun on 2+ nodes:

Confirm each node has its own metadata.json
No CheckpointingException occurs
Final log should show: Loaded the disk checkpoint.

Fixes #1530

Apr 13 '25 09:04 Pranaykarvi

Thanks Pranaykarvi for the quick fix and testing!

I also want to learn more from Megatron team what's the design assumption about this metadata,

whether it's assumed the metadata directory is hosted by a distributed file system(e.g. NFS)?
Whether users have the responsibility to manage the metadata synch across node?
Whether application side should have the handling to only load the metadata from master node rather than all nodes.

Apr 14 '25 16:04 felixwqp

Marking as stale. No activity in 60 days.

Jun 13 '25 18:06 github-actions[bot]

This PR was closed because it has been inactive for 7 days since being marked as stale.

Jul 27 '25 02:07 github-actions[bot]

@sbhavani can we merge this fix? we're also running into this

Oct 13 '25 17:10 vutrung96

bump here! would also like this to be merged

Nov 26 '25 18:11 erictang000