dlrover icon indicating copy to clipboard operation
dlrover copied to clipboard

Why model_optim_rng.pt is saved in a seperate directory?

Open zhaoyang-star opened this issue 1 year ago • 7 comments

Megatron-LM saves model_optim_rng.pt and distrib_optim.pt in directory named as mp_rank_xx_xxx. But In dlrover, distrib_optim.pt is been seperated and saved in a directory named as rank_xxxx.

It is ok if ckpt are been saved and loaded by using dlrover. But it will fail if saved by using Megatron-LM and then loaded by dlrover. So I am curious why it is been designed as this way? Thanks @workingloong

zhaoyang-star avatar Aug 02 '24 07:08 zhaoyang-star

The flash checkpoint in DLRover saves and loads the distributed optimizer checkpoint of Megatron-LM in parallel. This is, each rank saves and loads its owner shard of optimizer states into the rank_xxxx file. You can see the detail https://github.com/intelligent-machine-learning/dlrover/blob/master/docs/blogs/megatron_flash_checkpoint.md#save-and-load-distributed-optimizer-in-parallel

workingloong avatar Aug 05 '24 08:08 workingloong

@workingloong Thanks for your quick reply. I got it.

I tried benchmarking dlrover and found save_to_memory costs ~55sec. Is it normal? From the blogs the cost of save_to_memory is below 1sec. Please correct if I misunderstand anything. Parts of logs as following:

192.169.125.62: saving checkpoint at iteration     800 to /mnt/home/flash_checkpoint_output_0802/outputs/checkpoint/16b-lr1e-4-tp1-pp4
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 7 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 1 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 5 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,238] [INFO] [engine.py:303:save_state_dict_to_memory] 3 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,249] [INFO] [engine.py:303:save_state_dict_to_memory] 2 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,250] [INFO] [engine.py:303:save_state_dict_to_memory] 0 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,250] [INFO] [engine.py:303:save_state_dict_to_memory] 6 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,251] [INFO] [engine.py:303:save_state_dict_to_memory] 4 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:36:35,564] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_memory in 49.314s.
192.169.125.62: [2024-08-02 13:36:36,881] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_memory in 50.645s.
192.169.125.62: [2024-08-02 13:36:37,891] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_memory in 51.654s.
192.169.125.62: [2024-08-02 13:36:38,761] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_memory in 52.525s.
192.169.125.62: [2024-08-02 13:36:40,280] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_memory in 54.031s.
192.169.125.62: [2024-08-02 13:36:42,972] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_memory in 56.722s.
192.169.125.62: [2024-08-02 13:36:55,181] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_memory in 68.931s.
192.169.125.62: [2024-08-02 13:37:33,870] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_memory in 107.633s.
192.169.125.62: [2024-08-02 13:37:33,870] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [ckpt_saver.py:532:_sync_shm_to_storage] ShardingSaver save checkpoint to storage, event CheckpointEvent(type=<CheckpointEventType.SAVE: 1>, step=800, global_shard_num=0)
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_storage in 107.62s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_storage in 107.634s.
192.169.125.62: (min, max) time across ranks (ms):
192.169.125.62:     save-checkpoint ................................: (107635.64, 107635.84)

zhaoyang-star avatar Aug 05 '24 08:08 zhaoyang-star

Just another question: Megatron-LM has supported asynchronous checkpoint saving since v0.7.0. Have you compared between dlrover and v0.7.0?

zhaoyang-star avatar Aug 05 '24 09:08 zhaoyang-star

@workingloong Thanks for your quick reply. I got it.

I tried benchmarking dlrover and found save_to_memory costs ~55sec. Is it normal? From the blogs the cost of save_to_memory is below 1sec. Please correct if I misunderstand anything. Parts of logs as following:

192.169.125.62: saving checkpoint at iteration     800 to /mnt/home/flash_checkpoint_output_0802/outputs/checkpoint/16b-lr1e-4-tp1-pp4
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 7 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 1 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,237] [INFO] [engine.py:303:save_state_dict_to_memory] 5 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,238] [INFO] [engine.py:303:save_state_dict_to_memory] 3 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,249] [INFO] [engine.py:303:save_state_dict_to_memory] 2 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,250] [INFO] [engine.py:303:save_state_dict_to_memory] 0 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,250] [INFO] [engine.py:303:save_state_dict_to_memory] 6 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:35:46,251] [INFO] [engine.py:303:save_state_dict_to_memory] 4 acquired the lock of shared memory: True.
192.169.125.62: [2024-08-02 13:36:35,564] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_memory in 49.314s.
192.169.125.62: [2024-08-02 13:36:36,881] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_memory in 50.645s.
192.169.125.62: [2024-08-02 13:36:37,891] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_memory in 51.654s.
192.169.125.62: [2024-08-02 13:36:38,761] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_memory in 52.525s.
192.169.125.62: [2024-08-02 13:36:40,280] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_memory in 54.031s.
192.169.125.62: [2024-08-02 13:36:42,972] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_memory in 56.722s.
192.169.125.62: [2024-08-02 13:36:55,181] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_memory in 68.931s.
192.169.125.62: [2024-08-02 13:37:33,870] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_memory in 107.633s.
192.169.125.62: [2024-08-02 13:37:33,870] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_storage in 107.634s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [ckpt_saver.py:532:_sync_shm_to_storage] ShardingSaver save checkpoint to storage, event CheckpointEvent(type=<CheckpointEventType.SAVE: 1>, step=800, global_shard_num=0)
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_storage in 107.62s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_storage in 107.621s.
192.169.125.62: [2024-08-02 13:37:33,871] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_storage in 107.634s.
192.169.125.62: (min, max) time across ranks (ms):
192.169.125.62:     save-checkpoint ................................: (107635.64, 107635.84)

Did you use distributed_optimizer and the following APIs?

from dlrover.trainer.torch.flash_checkpoint.megatron_dist_ckpt import save_checkpoint
from dlrover.trainer.torch.flash_checkpoint.megatron_dist_ckpt import load_checkpoint

workingloong avatar Aug 05 '24 12:08 workingloong

Just another question: Megatron-LM has supported asynchronous checkpoint saving since v0.7.0. Have you compared between dlrover and v0.7.0?

Not yet.

workingloong avatar Aug 05 '24 12:08 workingloong

Did you use distributed_optimizer and the following APIs?

Yes, both are used. It is weird when training a 16B model, the saving to memory costs about 50sec. BTW, the memory saving time is also about 50sec when using Megatron-LM's async save. Maybe the bandwidth of my env's disk is low.

zhaoyang-star avatar Aug 06 '24 02:08 zhaoyang-star

ut 50sec. BTW, the memory saving time is also about 50sec when using Megatron-LM's async save. Maybe the bandwidth of my env's disk is

Yeah, the performance disk may affect the performance to save the checkpoint into the memory. Because, the async checkpoint use the shared memory which need to create a file on the disk. I conducted some experiments and found that the performance to save the checkpoint into the memory with SSD is much better than NAS.

workingloong avatar Aug 06 '24 12:08 workingloong

This issue has been automatically marked as stale because it has not had recent activity.

github-actions[bot] avatar Nov 05 '24 01:11 github-actions[bot]

This issue is being automatically closed due to inactivity.

github-actions[bot] avatar Nov 13 '24 01:11 github-actions[bot]