volcano
volcano copied to clipboard
volcano support checkpoint
What would you like to be added:
Hope volcano can support checkpoint. When the job is interrupted due to external reasons, it can continue to run according to the checkpoint.
Why is this needed:
some lengthy calculations can run for days.If the calculation node dies, we have to restart from the beginning.Having a checkpoint created, we can at least restart from this point of calculation.
May I know which kind of computing engine e.g spark are you using? Usually the chickpoint function is implemented in upper layer framework, e.g pytorch.
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗