DeepSpeed
DeepSpeed copied to clipboard
add pod level retry relaunch with worker number unchanged
add relaunch logic for "pod level retry", in this case, the workers numbers will be the same, only the the new started workers's Ip will be changed, need to re-launch the script in this case.
Can one of the admins verify this patch?
@shadowtudark - apologies for not reviewing this before, but is this still a PR you think brings value? If so, happy to review.
Closing as stale for now/target branch not master as well. If this is needed, please re-open or re-create this PR.