[QUESTION] Asynchronous Checkpoint Saving
I saw Megatron-LM has supported asynchronous checkpoint saving since v0.7.0. @sbak5 I did some test on this feature and saw it benefits a lot. I tried to dive into it and found the ckpt's format has changed a lot compared to the synchronous saving.
Just 3 questions:
- What is the meaning of
__0_0.distcpand__0_1.distcp? There is no readme or blog about this feature. Could you please explain it? - How to convert this format to the synchronous saving format? Such as
distrib_optim.ptandmodel_optim_rng.pt. - How to convert this format to HuggingFace .bin format in order to do inference?
Thanks for your help ^_^
root@inp11049767626817836924-2-1:/home/Megatron-LM/fp8_async_save_te1.7_outputs/checkpoint/8B-lr1e-4-tp1-pp4# tree -lh
.
├── [ 325] iter_0000010
│ ├── [2.4G] __0_0.distcp
│ ├── [ 31G] __0_1.distcp
│ ├── [1.8G] __2_0.distcp
│ ├── [ 23G] __2_1.distcp
│ ├── [1.8G] __4_0.distcp
│ ├── [ 23G] __4_1.distcp
│ ├── [2.4G] __6_0.distcp
│ ├── [ 31G] __6_1.distcp
│ ├── [ 15K] common.pt
│ └── [ 119] metadata.json
├── [ 325] iter_0000020
│ ├── [2.4G] __0_0.distcp
│ ├── [ 31G] __0_1.distcp
│ ├── [1.8G] __2_0.distcp
│ ├── [ 23G] __2_1.distcp
│ ├── [1.8G] __4_0.distcp
│ ├── [ 23G] __4_1.distcp
│ ├── [2.4G] __6_0.distcp
│ ├── [ 31G] __6_1.distcp
│ ├── [ 15K] common.pt
│ └── [ 119] metadata.json
├── [ 325] iter_0000030
│ ├── [2.4G] __0_0.distcp
│ ├── [ 31G] __0_1.distcp
│ ├── [1.8G] __2_0.distcp
│ ├── [ 23G] __2_1.distcp
│ ├── [1.8G] __4_0.distcp
│ ├── [ 23G] __4_1.distcp
│ ├── [2.4G] __6_0.distcp
│ ├── [ 31G] __6_1.distcp
│ ├── [ 15K] common.pt
│ └── [ 119] metadata.json
└── [ 2] latest_checkpointed_iteration.txt
- What is the meaning of __0_0.distcp and __0_1.distcp? There is no readme or blog about this feature. Could you please explain it?
It has the format as
- How to convert this format to the synchronous saving format? Such as distrib_optim.pt and model_optim_rng.pt.
The previous checkpoint format in Megatron-LM was converted due to the introduction of dist_checkpointing by @mikolajblaz. So, synchronous checkpointing (--use-dist-ckpt only) also generates the same checkpoint.
If your question is how to convert the checkpoint in the new dist_checkpointing format to the previous legacy format, we don't have a converter for this.
- How to convert this format to HuggingFace .bin format in order to do inference?
Megatron-Core doesn't provide a converter. Any production framework based on Megatron-Core may have that such as NeMo.
It has the format as .distcp, global rank is straightforward. It's global rank in the default process group created by Pytorch. We create multiple writer processes to make checkpoint writing with asynchrony. writer process ID simply indicates where the corresponding checkpoint is from.
I noticed the file is named as __x_y.distcp. What do the x/y mean if TP=1/PP=4/DP=2? x will be 0/2/4/6 and y will be 0/1.
The previous checkpoint format in Megatron-LM was converted due to the introduction of dist_checkpointing by @mikolajblaz. So, synchronous checkpointing (--use-dist-ckpt only) also generates the same checkpoint. If your question is how to convert the checkpoint in the new dist_checkpointing format to the previous legacy format, we don't have a converter for this.
It is common we already got a ckpt with legacy format (such as distrib_optim.pt and model_optim_rng.pt) and want to do continuous training using --use-mcore-models --use-dist-ckpt --async-save. Is it possible to support this case?
Megatron-Core doesn't provide a converter. Any production framework based on Megatron-Core may have that such as NeMo.
Could you please provide a demo link for converting *.distcp to HF *.bin? @sbak5
I noticed the file is named as __x_y.distcp. What do the x/y mean if TP=1/PP=4/DP=2? x will be 0/2/4/6 and y will be 0/1.
Recently, we've introduced fully-parallel saving(FPS, --ckpt-fully-parallel-save) as well as async-parallel saving. without FPS, as you said, rank [0,2,4,6] will save checkpoints as before with 2 writer processes, which lead to [0, 1] for y.
If you turn on FPS, it leverages duplicate states of models in data-parallelism to parallelize checkpoint saving in inter-node level as described in the article below. In this case, x will be [0-8] and y is [0-1].
Megatron-Core Tech blog article
It is common we already got a ckpt with legacy format (such as distrib_optim.pt and model_optim_rng.pt) and want to do continuous training using --use-mcore-models --use-dist-ckpt --async-save. Is it possible to support this case?
The load_checkpoint basically tries to see if the checkpoint in the passed path is legacy or dist-ckpt format. So, you can load the legacy checkpoint and save in the newer format with the option
Could you please provide a demo link for converting *.distcp to HF *.bin? @sbak5
You can easily find examples in NeMo
The
load_checkpointbasically tries to see if the checkpoint in the passed path is legacy or dist-ckpt format. So, you can load the legacy checkpoint and save in the newer format with the option
I tested and the results are as followings. Note that:
- dist_ckpt format cannot be saved back as legacy format.
- The blog said Most importantly, Megatron-Core enables users to resume training from a checkpoint saved with different tensor and pipeline parallelism degrees, providing the flexibility to change training configurations as needed during training. But I found it failed due to mismatch TP/PP for DistributedOptimizer. From the codebase we can see
TODO: add DistributedOptimizer support for differing TPxPPis a TODO list. SO could I think this feature has not been implemented yet?
Please correct me if I misunderstand anything.
192.169.125.13: (TP, PP) mismatch after resume ((1, 4) vs (2, 2) from checkpoint): RNG state will be ignored
192.169.125.13: [rank3]: Traceback (most recent call last):
192.169.125.13: [rank3]: File "/home/Megatron-LM/pretrain_gpt.py", line 271, in <module>
192.169.125.13: [rank3]: pretrain(
192.169.125.13: [rank3]: File "/home/Megatron-LM/megatron/training/training.py", line 227, in pretrain
192.169.125.13: [rank3]: model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
192.169.125.13: [rank3]: File "/home/Megatron-LM/megatron/training/training.py", line 530, in setup_model_and_optimizer
192.169.125.13: [rank3]: args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
192.169.125.13: [rank3]: File "/home/Megatron-LM/megatron/training/checkpointing.py", line 715, in load_checkpoint
192.169.125.13: [rank3]: raise RuntimeError("{}: not supported for DistributedOptimizer".format(mismatch_msg))
192.169.125.13: [rank3]: RuntimeError: (TP, PP) mismatch after resume ((1, 4) vs (2, 2) from checkpoint): not supported for DistributedOptimizer
| Coversion | Supported or Not |
|-----------|------------------|
| lagency format -> load -> training -> save -> dist_ckpt format | √ |
| dist_ckpt format -> load -> training -> save -> lagency format | x |
| dist_ckpt format (with TP=1/PP=4/DP=2) -> load -> training -> save -> dist_ckpt format (with TP=2/PP=2/DP=2) | x |
You can easily find examples in NeMo
From the NeMo docs, there is a demo script converting ckpt trained by Megatron-LM into the NeMo compatible formats. And Using another script to convert NeMo format to HuggingFace format. This link is the support matrix and the way that directly converting from MLM to HF format is not found. So is there any easy way to convert dist_ckpt from MLM into HF format? Thanks for you kind help @sbak5
The
load_checkpointbasically tries to see if the checkpoint in the passed path is legacy or dist-ckpt format. So, you can load the legacy checkpoint and save in the newer format with the optionI tested and the results are as followings. Note that:
- dist_ckpt format cannot be saved back as legacy format.
- The blog said Most importantly, Megatron-Core enables users to resume training from a checkpoint saved with different tensor and pipeline parallelism degrees, providing the flexibility to change training configurations as needed during training. But I found it failed due to mismatch TP/PP for DistributedOptimizer. From the codebase we can see
TODO: add DistributedOptimizer support for differing TPxPPis a TODO list. SO could I think this feature has not been implemented yet?Please correct me if I misunderstand anything.
192.169.125.13: (TP, PP) mismatch after resume ((1, 4) vs (2, 2) from checkpoint): RNG state will be ignored 192.169.125.13: [rank3]: Traceback (most recent call last): 192.169.125.13: [rank3]: File "/home/Megatron-LM/pretrain_gpt.py", line 271, in <module> 192.169.125.13: [rank3]: pretrain( 192.169.125.13: [rank3]: File "/home/Megatron-LM/megatron/training/training.py", line 227, in pretrain 192.169.125.13: [rank3]: model, optimizer, opt_param_scheduler = setup_model_and_optimizer( 192.169.125.13: [rank3]: File "/home/Megatron-LM/megatron/training/training.py", line 530, in setup_model_and_optimizer 192.169.125.13: [rank3]: args.iteration, args.num_floating_point_operations_so_far = load_checkpoint( 192.169.125.13: [rank3]: File "/home/Megatron-LM/megatron/training/checkpointing.py", line 715, in load_checkpoint 192.169.125.13: [rank3]: raise RuntimeError("{}: not supported for DistributedOptimizer".format(mismatch_msg)) 192.169.125.13: [rank3]: RuntimeError: (TP, PP) mismatch after resume ((1, 4) vs (2, 2) from checkpoint): not supported for DistributedOptimizer| Coversion | Supported or Not | |-----------|------------------| | lagency format -> load -> training -> save -> dist_ckpt format | √ | | dist_ckpt format -> load -> training -> save -> lagency format | x | | dist_ckpt format (with TP=1/PP=4/DP=2) -> load -> training -> save -> dist_ckpt format (with TP=2/PP=2/DP=2) | x |You can easily find examples in NeMo
From the NeMo docs, there is a demo script converting ckpt trained by Megatron-LM into the NeMo compatible formats. And Using another script to convert NeMo format to HuggingFace format. This link is the support matrix and the way that directly converting from MLM to HF format is not found. So is there any easy way to convert dis_ckpt from MLM into HF format? Thanks for you kind help @sbak5
@zhaoyang-star Could you please provide a reference for your conversion to a Nemo script? I’m encountering this issue:
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/nlp_model.py", line 380, in load_from_checkpoint model = ptl_load_state(cls, checkpoint, strict=strict, cfg=cfg, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/saving.py", line 144, in _load_state obj = cls(_cls_kwargs) File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 243, in init super().init(cfg, trainer=trainer, no_lm_init=True) File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 118, in init with open_dict(cfg): File "/usr/lib/python3.10/contextlib.py", line 135, in enter return next(self.gen) File "/usr/local/lib/python3.10/dist-packages/omegaconf/omegaconf.py", line 983, in open_dict prev_state = config._get_node_flag("struct") AttributeError: 'dict' object has no attribute '_get_node_flag'
thanks
@syx11237744 This link is about converting from MLM to NeMo. But I haven't tested it yet.
@syx11237744 This link is about converting from MLM to NeMo. But I haven't tested it yet.
Thank you! I’m using this script, but I’m encountering the above error. Do you know of any other methods to convert to the HuggingFace format besides the one mentioned above?
@sbak5 @lmcafee-nvidia It is great to see that there are convert tools: tools/checkpoint/convert.py in Megatron-LM repo. Is there any docs for converting from torch_dist format into huggingface format? Thanks.
dist_checkpointing is introduced by @mikolajblaz , how could I convert from torch_dist format into torch format?
@sbak5 @lmcafee-nvidia It is great to see that there are convert tools: tools/checkpoint/convert.py in Megatron-LM repo. Is there any docs for converting from
torch_distformat into huggingface format? Thanks.dist_checkpointingis introduced by @mikolajblaz , how could I convert fromtorch_distformat intotorchformat?
There is no converters from MLM directly to HF, you have to go through NeMo.
Converting from torch_dist to torch is a step backwards and is not recommended. However if you need it for some reason, it believe recent tools/checkpoint/convert.py should already support such conversion
Any updates about the conversion ? the Nemo converter does not support distcp format it uses the legacy format apparently Code
Usefull : DCP to Torch