oumi
oumi copied to clipboard
[Feature] Add support for Managed Jobs in Sky Clusters
Feature request
Managed jobs are spot instances that are resilient to preemption.
Pros:
- The job will restart automatically until it's completed
Cons:
- You cannot SSH into the machine running your job
- You need to manually add checkpointing to your job
- Jobs may move across regions due to hardware availability
Motivation / references
For long training jobs, we need to ensure that jobs on the cloud are resilient to preemption. Skypilot supports this.
Reference: https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html
Your contribution
This change requires:
Hi @taenin I want to work on this
Hi @cardit1, apologies for not responding earlier! We were pretty busy post-launch and this slipped through the cracks. We really appreciate you volunteering to help with this! Let me know if you're still interested, and I can assign the issue to you. Feel free to ask if you have any questions as well
Hey If no one is working on this, I would like to contribute.
Thank you Abhinav ! Assigned it to you for now since there was no previous activity.
Can you help me a bit in setup as non edited version also ending job with failed status. I am using Linux.
@IRONmanAbhi , could you please elaborate? What are the steps to reproduce the error, and what is the error message? Running oumi env is also useful to display local package versions.