Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[BUGFIX] Save dist_checkpointing metadata on all nodes for multi-node training

Open Pranaykarvi opened this issue 8 months ago • 5 comments

## Description

Fixes a bug where `metadata.json` is saved only on global rank 0 during distributed checkpointing, causing load failures on other nodes in non-shared filesystem setups.

### Fix

Changed the save condition to:

```python
if int(os.environ.get("LOCAL_RANK", 0)) == 0:
    save_config(...)

This ensures metadata.json is saved on each node (local rank 0), allowing successful checkpoint loading across all nodes.

Testing

Run dist_cp_save_load.py with torchrun on 2+ nodes:

  • Confirm each node has its own metadata.json
  • No CheckpointingException occurs
  • Final log should show: Loaded the disk checkpoint.

Fixes #1530

Pranaykarvi avatar Apr 13 '25 09:04 Pranaykarvi

Thanks Pranaykarvi for the quick fix and testing!

I also want to learn more from Megatron team what's the design assumption about this metadata,

  1. whether it's assumed the metadata directory is hosted by a distributed file system(e.g. NFS)?
  2. Whether users have the responsibility to manage the metadata synch across node?
  3. Whether application side should have the handling to only load the metadata from master node rather than all nodes.

felixwqp avatar Apr 14 '25 16:04 felixwqp

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Jun 13 '25 18:06 github-actions[bot]

This PR was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Jul 27 '25 02:07 github-actions[bot]

@sbhavani can we merge this fix? we're also running into this

vutrung96 avatar Oct 13 '25 17:10 vutrung96

bump here! would also like this to be merged

erictang000 avatar Nov 26 '25 18:11 erictang000