Added a Kubernetes Kueue batch job scheduler based on the Kubernetes Scheduler Note: variable local_kueue="local-kueue-name" is required in the scheduler args for the queue-name label and for priority add kueue_priority_class="kueue-priority-class-name" to the sheduler args

Manual Testing

Set up a Kubernetes/Openshift cluster

Install Kueue
- Create necessary Kueue resources: Cluster Queue, Resource Flavour, Local Queue and workload priority classes for testing job priority
Clone this branch locally and create a torchx whl file with python3 setup.py bdist_wheel
Install the whl file with pip install dist/torchx-0.7.0.dev0-py3-none-any.whl
Authenticate to your K8s cluster
Run a sample job with torchx run --scheduler kueue_job --scheduler_args namespace=default,local_kueue="default-kueue",image_repo="user/alpine" utils.echo --image alpine:latest --msg hello - should return something like kueue_job://torchx_user/1234
Get the jobs status with torchx status kueue_job://torchx_user/1234
All jobs created with this scheduler should gain the Suspended/JobResumed status
Users can add custom annotations in the scheduler args following this scheme "annotations": {"key":"value"}

Integration test

Start the minikube setup script with bash setup_minikube_kueue.sh
Run the test file with python scripts/kueue_test.py --container_repo localhost:5000/torchx
For dry_run run python scripts/kueue_test.py --container_repo localhost:5000/torchx --dryrun

Test plan:

Created Unit tests based on Kubernetes Unit tests.

Feb 14 '24 10:02 Bobbins228

Hi @Bobbins228!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Feb 14 '24 10:02 facebook-github-bot

What does kueue do that volcano doesn't? Is there a reason to support more than one batch scheduler on kubernetes?

Feb 17 '24 09:02 ccharest93

Hi @ccharest93, We have decided to add a Kueue Batch Job scheduler as Kueue is backed by Kubernetes Batch Sig and this gives users more options in terms of choosing a Kubernetes Scheduler.

Feb 27 '24 09:02 Bobbins228

Looks like there's also valid pyre/lint issues

Mar 26 '24 19:03 d4l3k

@Bobbins228 there's been a bunch of cleanup on tests in on the main branch -- if you rebase/merge all tests should be passing (assuming no new breaking changes)

Mar 27 '24 23:03 d4l3k

@Bobbins228 did you try multi-node ddp with it? I think you'd need a JobSet for that which sets up a headless k8s service for workers to reach the master. See example here https://github.com/kubernetes-sigs/jobset/blob/main/examples/pytorch/resnet-cifar10/resnet.yaml

Another way to approach this is perhaps to not be tightly coupled with a specific operator but rather just output a k8s standard Job (or even Pod) spec with the container image and allow it to be piped into other tools that can transform it into whatever that's suitable. But AFAICT something like a --output-format yaml isn't supported in torchx and --dryrun aims to provide human readable instead of tool parsable output, so I don't know if this is realistic?

Mar 28 '24 01:03 xujyan

@xujyan @Bobbins228 looks like the ddp job on Kueue is having issues

Mar 28 '24 22:03 d4l3k

I would expect that. Like I suggested above you'd need to produce a JobSet instead of a Job, and as the example here shows, point to the master at

              - name: MASTER_ADDR
                value: "pytorch-workers-0-0.pytorch"

where the value is of the format <jobset_name>-<job_name>-0-0.<jobset_name>

Mar 29 '24 00:03 xujyan

And you'd need to install jobset controller in the test env too, see README

Mar 29 '24 00:03 xujyan

torchx
torchx copied to clipboard

Added Kueue Job Scheduler

Manual Testing

Integration test

Action Required

Process

torchx torchx copied to clipboard

Added Kueue Job Scheduler

Manual Testing

Integration test

Action Required

Process

torchx
torchx copied to clipboard