oumi icon indicating copy to clipboard operation
oumi copied to clipboard

[Feature] Add support for Managed Jobs in Sky Clusters

Open taenin opened this issue 10 months ago • 6 comments

Feature request

Managed jobs are spot instances that are resilient to preemption.

Pros:

  • The job will restart automatically until it's completed

Cons:

  • You cannot SSH into the machine running your job
  • You need to manually add checkpointing to your job
  • Jobs may move across regions due to hardware availability

Motivation / references

For long training jobs, we need to ensure that jobs on the cloud are resilient to preemption. Skypilot supports this.

Reference: https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html

Your contribution

This change requires:

  • Updating our JobConfig to let users specify this setting
  • Updating our SkyClient to import import sky.jobs and call sky.jobs.launch when using managed jobs.

taenin avatar Feb 04 '25 22:02 taenin

Hi @taenin I want to work on this

cardit1 avatar Feb 05 '25 07:02 cardit1

Hi @cardit1, apologies for not responding earlier! We were pretty busy post-launch and this slipped through the cracks. We really appreciate you volunteering to help with this! Let me know if you're still interested, and I can assign the issue to you. Feel free to ask if you have any questions as well

wizeng23 avatar Feb 19 '25 22:02 wizeng23

Hey If no one is working on this, I would like to contribute.

IRONmanAbhi avatar Mar 07 '25 10:03 IRONmanAbhi

Thank you Abhinav ! Assigned it to you for now since there was no previous activity.

nikg4 avatar Mar 07 '25 17:03 nikg4

Can you help me a bit in setup as non edited version also ending job with failed status. I am using Linux.

IRONmanAbhi avatar Mar 10 '25 10:03 IRONmanAbhi

@IRONmanAbhi , could you please elaborate? What are the steps to reproduce the error, and what is the error message? Running oumi env is also useful to display local package versions.

wizeng23 avatar Apr 08 '25 06:04 wizeng23