Megatron-LM [QUESTION] Asynchronous Checkpoint Saving

I saw Megatron-LM has supported asynchronous checkpoint saving since v0.7.0. @sbak5 I did some test on this feature and saw it benefits a lot. I tried to dive into it and found the ckpt's format has changed a lot compared to the synchronous saving.

Just 3 questions:

What is the meaning of __0_0.distcp and __0_1.distcp? There is no readme or blog about this feature. Could you please explain it?
How to convert this format to the synchronous saving format? Such as distrib_optim.pt and model_optim_rng.pt.
How to convert this format to HuggingFace .bin format in order to do inference?

Thanks for your help ^_^

root@inp11049767626817836924-2-1:/home/Megatron-LM/fp8_async_save_te1.7_outputs/checkpoint/8B-lr1e-4-tp1-pp4# tree -lh
.
├── [ 325]  iter_0000010
│   ├── [2.4G]  __0_0.distcp
│   ├── [ 31G]  __0_1.distcp
│   ├── [1.8G]  __2_0.distcp
│   ├── [ 23G]  __2_1.distcp
│   ├── [1.8G]  __4_0.distcp
│   ├── [ 23G]  __4_1.distcp
│   ├── [2.4G]  __6_0.distcp
│   ├── [ 31G]  __6_1.distcp
│   ├── [ 15K]  common.pt
│   └── [ 119]  metadata.json
├── [ 325]  iter_0000020
│   ├── [2.4G]  __0_0.distcp
│   ├── [ 31G]  __0_1.distcp
│   ├── [1.8G]  __2_0.distcp
│   ├── [ 23G]  __2_1.distcp
│   ├── [1.8G]  __4_0.distcp
│   ├── [ 23G]  __4_1.distcp
│   ├── [2.4G]  __6_0.distcp
│   ├── [ 31G]  __6_1.distcp
│   ├── [ 15K]  common.pt
│   └── [ 119]  metadata.json
├── [ 325]  iter_0000030
│   ├── [2.4G]  __0_0.distcp
│   ├── [ 31G]  __0_1.distcp
│   ├── [1.8G]  __2_0.distcp
│   ├── [ 23G]  __2_1.distcp
│   ├── [1.8G]  __4_0.distcp
│   ├── [ 23G]  __4_1.distcp
│   ├── [2.4G]  __6_0.distcp
│   ├── [ 31G]  __6_1.distcp
│   ├── [ 15K]  common.pt
│   └── [ 119]  metadata.json
└── [   2]  latest_checkpointed_iteration.txt

Aug 05 '24 09:08 zhaoyang-star

What is the meaning of __0_0.distcp and __0_1.distcp? There is no readme or blog about this feature. Could you please explain it?

It has the format as <writer process ID>.distcp, global rank is straightforward. It's global rank in the default process group created by Pytorch. We create multiple writer processes to make checkpoint writing with asynchrony. writer process ID simply indicates where the corresponding checkpoint is from.

How to convert this format to the synchronous saving format? Such as distrib_optim.pt and model_optim_rng.pt.

The previous checkpoint format in Megatron-LM was converted due to the introduction of dist_checkpointing by @mikolajblaz. So, synchronous checkpointing (--use-dist-ckpt only) also generates the same checkpoint. If your question is how to convert the checkpoint in the new dist_checkpointing format to the previous legacy format, we don't have a converter for this.

How to convert this format to HuggingFace .bin format in order to do inference?

Megatron-Core doesn't provide a converter. Any production framework based on Megatron-Core may have that such as NeMo.

Aug 05 '24 18:08 sbak5

It has the format as .distcp, global rank is straightforward. It's global rank in the default process group created by Pytorch. We create multiple writer processes to make checkpoint writing with asynchrony. writer process ID simply indicates where the corresponding checkpoint is from.

I noticed the file is named as __x_y.distcp. What do the x/y mean if TP=1/PP=4/DP=2? x will be 0/2/4/6 and y will be 0/1.

The previous checkpoint format in Megatron-LM was converted due to the introduction of dist_checkpointing by @mikolajblaz. So, synchronous checkpointing (--use-dist-ckpt only) also generates the same checkpoint. If your question is how to convert the checkpoint in the new dist_checkpointing format to the previous legacy format, we don't have a converter for this.

It is common we already got a ckpt with legacy format (such as distrib_optim.pt and model_optim_rng.pt) and want to do continuous training using --use-mcore-models --use-dist-ckpt --async-save. Is it possible to support this case?

Megatron-Core doesn't provide a converter. Any production framework based on Megatron-Core may have that such as NeMo.

Could you please provide a demo link for converting *.distcp to HF *.bin? @sbak5

Aug 06 '24 02:08 zhaoyang-star

I noticed the file is named as __x_y.distcp. What do the x/y mean if TP=1/PP=4/DP=2? x will be 0/2/4/6 and y will be 0/1.

Recently, we've introduced fully-parallel saving(FPS, --ckpt-fully-parallel-save) as well as async-parallel saving. without FPS, as you said, rank [0,2,4,6] will save checkpoints as before with 2 writer processes, which lead to [0, 1] for y. If you turn on FPS, it leverages duplicate states of models in data-parallelism to parallelize checkpoint saving in inter-node level as described in the article below. In this case, x will be [0-8] and y is [0-1]. Megatron-Core Tech blog article

It is common we already got a ckpt with legacy format (such as distrib_optim.pt and model_optim_rng.pt) and want to do continuous training using --use-mcore-models --use-dist-ckpt --async-save. Is it possible to support this case?

The load_checkpoint basically tries to see if the checkpoint in the passed path is legacy or dist-ckpt format. So, you can load the legacy checkpoint and save in the newer format with the option

Could you please provide a demo link for converting *.distcp to HF *.bin? @sbak5

You can easily find examples in NeMo

Aug 06 '24 05:08 sbak5

The load_checkpoint basically tries to see if the checkpoint in the passed path is legacy or dist-ckpt format. So, you can load the legacy checkpoint and save in the newer format with the option

I tested and the results are as followings. Note that:

dist_ckpt format cannot be saved back as legacy format.
The blog said Most importantly, Megatron-Core enables users to resume training from a checkpoint saved with different tensor and pipeline parallelism degrees, providing the flexibility to change training configurations as needed during training. But I found it failed due to mismatch TP/PP for DistributedOptimizer. From the codebase we can see TODO: add DistributedOptimizer support for differing TPxPP is a TODO list. SO could I think this feature has not been implemented yet?

Please correct me if I misunderstand anything.

192.169.125.13: (TP, PP) mismatch after resume ((1, 4) vs (2, 2) from checkpoint): RNG state will be ignored
192.169.125.13: [rank3]: Traceback (most recent call last):
192.169.125.13: [rank3]:   File "/home/Megatron-LM/pretrain_gpt.py", line 271, in <module>
192.169.125.13: [rank3]:     pretrain(
192.169.125.13: [rank3]:   File "/home/Megatron-LM/megatron/training/training.py", line 227, in pretrain
192.169.125.13: [rank3]:     model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
192.169.125.13: [rank3]:   File "/home/Megatron-LM/megatron/training/training.py", line 530, in setup_model_and_optimizer
192.169.125.13: [rank3]:     args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
192.169.125.13: [rank3]:   File "/home/Megatron-LM/megatron/training/checkpointing.py", line 715, in load_checkpoint
192.169.125.13: [rank3]:     raise RuntimeError("{}: not supported for DistributedOptimizer".format(mismatch_msg))
192.169.125.13: [rank3]: RuntimeError: (TP, PP) mismatch after resume ((1, 4) vs (2, 2) from checkpoint): not supported for DistributedOptimizer

| Coversion | Supported or Not |
|-----------|------------------|
| lagency format -> load -> training -> save -> dist_ckpt format | √ |
| dist_ckpt format -> load -> training -> save -> lagency format | x |
| dist_ckpt format (with TP=1/PP=4/DP=2) -> load -> training -> save -> dist_ckpt format (with TP=2/PP=2/DP=2) | x |

You can easily find examples in NeMo

From the NeMo docs, there is a demo script converting ckpt trained by Megatron-LM into the NeMo compatible formats. And Using another script to convert NeMo format to HuggingFace format. This link is the support matrix and the way that directly converting from MLM to HF format is not found. So is there any easy way to convert dist_ckpt from MLM into HF format? Thanks for you kind help @sbak5

Aug 06 '24 08:08 zhaoyang-star

The load_checkpoint basically tries to see if the checkpoint in the passed path is legacy or dist-ckpt format. So, you can load the legacy checkpoint and save in the newer format with the option

I tested and the results are as followings. Note that:

dist_ckpt format cannot be saved back as legacy format.

The blog said Most importantly, Megatron-Core enables users to resume training from a checkpoint saved with different tensor and pipeline parallelism degrees, providing the flexibility to change training configurations as needed during training. But I found it failed due to mismatch TP/PP for DistributedOptimizer. From the codebase we can see TODO: add DistributedOptimizer support for differing TPxPP is a TODO list. SO could I think this feature has not been implemented yet?

Please correct me if I misunderstand anything.
192.169.125.13: (TP, PP) mismatch after resume ((1, 4) vs (2, 2) from checkpoint): RNG state will be ignored
192.169.125.13: [rank3]: Traceback (most recent call last):
192.169.125.13: [rank3]:   File "/home/Megatron-LM/pretrain_gpt.py", line 271, in <module>
192.169.125.13: [rank3]:     pretrain(
192.169.125.13: [rank3]:   File "/home/Megatron-LM/megatron/training/training.py", line 227, in pretrain
192.169.125.13: [rank3]:     model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
192.169.125.13: [rank3]:   File "/home/Megatron-LM/megatron/training/training.py", line 530, in setup_model_and_optimizer
192.169.125.13: [rank3]:     args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
192.169.125.13: [rank3]:   File "/home/Megatron-LM/megatron/training/checkpointing.py", line 715, in load_checkpoint
192.169.125.13: [rank3]:     raise RuntimeError("{}: not supported for DistributedOptimizer".format(mismatch_msg))
192.169.125.13: [rank3]: RuntimeError: (TP, PP) mismatch after resume ((1, 4) vs (2, 2) from checkpoint): not supported for DistributedOptimizer
| Coversion | Supported or Not |
|-----------|------------------|
| lagency format -> load -> training -> save -> dist_ckpt format | √ |
| dist_ckpt format -> load -> training -> save -> lagency format | x |
| dist_ckpt format (with TP=1/PP=4/DP=2) -> load -> training -> save -> dist_ckpt format (with TP=2/PP=2/DP=2) | x |
You can easily find examples in NeMo

From the NeMo docs, there is a demo script converting ckpt trained by Megatron-LM into the NeMo compatible formats. And Using another script to convert NeMo format to HuggingFace format. This link is the support matrix and the way that directly converting from MLM to HF format is not found. So is there any easy way to convert dis_ckpt from MLM into HF format? Thanks for you kind help @sbak5

@zhaoyang-star Could you please provide a reference for your conversion to a Nemo script? I’m encountering this issue：

File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/nlp_model.py", line 380, in load_from_checkpoint model = ptl_load_state(cls, checkpoint, strict=strict, cfg=cfg, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/saving.py", line 144, in _load_state obj = cls(_cls_kwargs) File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 243, in init super().init(cfg, trainer=trainer, no_lm_init=True) File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 118, in init with open_dict(cfg): File "/usr/lib/python3.10/contextlib.py", line 135, in enter return next(self.gen) File "/usr/local/lib/python3.10/dist-packages/omegaconf/omegaconf.py", line 983, in open_dict prev_state = config._get_node_flag("struct") AttributeError: 'dict' object has no attribute '_get_node_flag'

thanks

Aug 06 '24 13:08 syx11237744

@syx11237744 This link is about converting from MLM to NeMo. But I haven't tested it yet.

Aug 07 '24 01:08 zhaoyang-star

@syx11237744 This link is about converting from MLM to NeMo. But I haven't tested it yet.

Thank you! I’m using this script, but I’m encountering the above error. Do you know of any other methods to convert to the HuggingFace format besides the one mentioned above?

Aug 07 '24 02:08 syx11237744

@sbak5 @lmcafee-nvidia It is great to see that there are convert tools: tools/checkpoint/convert.py in Megatron-LM repo. Is there any docs for converting from torch_dist format into huggingface format? Thanks. dist_checkpointing is introduced by @mikolajblaz , how could I convert from torch_dist format into torch format?

Aug 12 '24 09:08 zhaoyang-star

@sbak5 @lmcafee-nvidia It is great to see that there are convert tools: tools/checkpoint/convert.py in Megatron-LM repo. Is there any docs for converting from torch_dist format into huggingface format? Thanks. dist_checkpointing is introduced by @mikolajblaz , how could I convert from torch_dist format into torch format?

There is no converters from MLM directly to HF, you have to go through NeMo.

Converting from torch_dist to torch is a step backwards and is not recommended. However if you need it for some reason, it believe recent tools/checkpoint/convert.py should already support such conversion

Aug 21 '24 09:08 mikolajblaz

Any updates about the conversion ? the Nemo converter does not support distcp format it uses the legacy format apparently Code

Aug 27 '24 15:08 SkanderBS2024

Usefull : DCP to Torch

Aug 27 '24 16:08 SkanderBS2024