skypilot
skypilot copied to clipboard
[Core] New JobGroup concept for launching a set of jobs with a single command
We mainly want to be able to specify something like a "job group" in a single YAML file and launch/stop it with a single command line. Each job in the group can have it's own number of nodes, resource requirements, and entry point command. And then we'd need a way to connect the jobs within a group (e.g. pass in networking addresses of one job as a flag into another job). https://github.com/skypilot-org/skypilot/discussions/8292#discussioncomment-15244515
As requested here, it could be useful to have a JobGroup concept to allow multiple jobs in a single YAML to be launched together with a single command. It would enable use cases like a heterogeneous RL training.