DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

add pod level retry relaunch with worker number unchanged

Open shadowtudark opened this issue 4 years ago • 1 comments

add relaunch logic for "pod level retry", in this case, the workers numbers will be the same, only the the new started workers's Ip will be changed, need to re-launch the script in this case.

shadowtudark avatar Aug 07 '21 01:08 shadowtudark

Can one of the admins verify this patch?

rocm-mici avatar Jun 09 '22 20:06 rocm-mici

@shadowtudark - apologies for not reviewing this before, but is this still a PR you think brings value? If so, happy to review.

loadams avatar Aug 18 '23 20:08 loadams

Closing as stale for now/target branch not master as well. If this is needed, please re-open or re-create this PR.

loadams avatar Aug 23 '23 15:08 loadams