DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

add checkpoint

Open zhangsmallshark opened this issue 11 months ago • 11 comments

support checkpoint for domino

zhangsmallshark avatar Dec 16 '24 19:12 zhangsmallshark

Looks good to me.

cc @tjruwase

GuanhuaWang avatar Jan 22 '25 23:01 GuanhuaWang

Works as expected. I have one question I'd like to confirm. Do we need to save the status of data loader to avoid reusing data samples?

@hwchen2017 , this is a good question, if this is standard in pytorch or megatron, we should keep, otherwise we can skip it.

GuanhuaWang avatar Jan 28 '25 22:01 GuanhuaWang

Works as expected. I have one question I'd like to confirm. Do we need to save the status of data loader to avoid reusing data samples?

This is a reason why args is saved as part of checkpoint. I recommend following the pattern in Megatron-DeepSpeed https://github.com/microsoft/Megatron-DeepSpeed/blob/f4157bea69f3df8c6cb66f2ebcda66ba03d1288e/megatron/checkpointing.py#L602-L611

tjruwase avatar Jan 29 '25 14:01 tjruwase

@zhangsmallshark please address above comments. Thanks!

GuanhuaWang avatar Feb 04 '25 23:02 GuanhuaWang

@GuanhuaWang I fixed it. Please check it.

zhangsmallshark avatar Feb 10 '25 16:02 zhangsmallshark

@zhangsmallshark - could you sign off with DCO on this PR? It replaces the CLA we had before. To fix it, the steps should be here

loadams avatar Feb 10 '25 19:02 loadams

@zhangsmallshark - could you sign off with DCO on this PR? It replaces the CLA we had before. To fix it, the steps should be here

I am working on it.

zhangsmallshark avatar Feb 12 '25 14:02 zhangsmallshark

I think I fixed it. I have tried:

git rebase HEAD~10 --signoff git push --force-with-lease origin master

zhangsmallshark avatar Feb 12 '25 15:02 zhangsmallshark

I think I fixed it. I have tried:

git rebase HEAD~10 --signoff git push --force-with-lease origin master

It looks like you'd need to merge the DeepSpeedExamples master branch back in now. If you can't get it to work, we can override it too if you need to revert your most recent push.

loadams avatar Feb 12 '25 16:02 loadams

You can override it. Thanks.

zhangsmallshark avatar Feb 12 '25 17:02 zhangsmallshark

@zhangsmallshark , please resolve this branch conflict as we discussed, thanks

GuanhuaWang avatar Feb 12 '25 23:02 GuanhuaWang