DeepSpeedExamples
DeepSpeedExamples copied to clipboard
add checkpoint
support checkpoint for domino
Looks good to me.
cc @tjruwase
Works as expected. I have one question I'd like to confirm. Do we need to save the status of data loader to avoid reusing data samples?
@hwchen2017 , this is a good question, if this is standard in pytorch or megatron, we should keep, otherwise we can skip it.
Works as expected. I have one question I'd like to confirm. Do we need to save the status of data loader to avoid reusing data samples?
This is a reason why args is saved as part of checkpoint. I recommend following the pattern in Megatron-DeepSpeed
https://github.com/microsoft/Megatron-DeepSpeed/blob/f4157bea69f3df8c6cb66f2ebcda66ba03d1288e/megatron/checkpointing.py#L602-L611
@zhangsmallshark please address above comments. Thanks!
@GuanhuaWang I fixed it. Please check it.
@zhangsmallshark - could you sign off with DCO on this PR? It replaces the CLA we had before. To fix it, the steps should be here
@zhangsmallshark - could you sign off with DCO on this PR? It replaces the CLA we had before. To fix it, the steps should be here
I am working on it.
I think I fixed it. I have tried:
git rebase HEAD~10 --signoff git push --force-with-lease origin master
I think I fixed it. I have tried:
git rebase HEAD~10 --signoff git push --force-with-lease origin master
It looks like you'd need to merge the DeepSpeedExamples master branch back in now. If you can't get it to work, we can override it too if you need to revert your most recent push.
You can override it. Thanks.
@zhangsmallshark , please resolve this branch conflict as we discussed, thanks