Results 4 comments of PeterChg

/assign @terrytangyuan

> I am not sure if this is a common use case. Could you elaborate? The ability to suspend and resume Jobs is often desired when cluster resources are limited...

> What are the changes you are trying to make to training operator? add some logic in pytorch job lifecycle, delete pods when job suspened, create pods when job resumed....

> Can you add more info and update description? We love to add support for frameworks like Deepspeed and LLM examples. EBay are your thoughts? With the open source of...