torchx icon indicating copy to clipboard operation
torchx copied to clipboard

Added Kueue Job Scheduler

Open Bobbins228 opened this issue 1 year ago • 9 comments

Added a Kubernetes Kueue batch job scheduler based on the Kubernetes Scheduler Note: variable local_kueue="local-kueue-name" is required in the scheduler args for the queue-name label and for priority add kueue_priority_class="kueue-priority-class-name" to the sheduler args

Manual Testing

Set up a Kubernetes/Openshift cluster

  • Install Kueue
  • Clone this branch locally and create a torchx whl file with python3 setup.py bdist_wheel
  • Install the whl file with pip install dist/torchx-0.7.0.dev0-py3-none-any.whl
  • Authenticate to your K8s cluster
  • Run a sample job with torchx run --scheduler kueue_job --scheduler_args namespace=default,local_kueue="default-kueue",image_repo="user/alpine" utils.echo --image alpine:latest --msg hello - should return something like kueue_job://torchx_user/1234
  • Get the jobs status with torchx status kueue_job://torchx_user/1234
  • All jobs created with this scheduler should gain the Suspended/JobResumed status
  • Users can add custom annotations in the scheduler args following this scheme "annotations": {"key":"value"}

Integration test

  • Start the minikube setup script with bash setup_minikube_kueue.sh
  • Run the test file with python scripts/kueue_test.py --container_repo localhost:5000/torchx
  • For dry_run run python scripts/kueue_test.py --container_repo localhost:5000/torchx --dryrun

Test plan:

Created Unit tests based on Kubernetes Unit tests.

Bobbins228 avatar Feb 14 '24 10:02 Bobbins228

Hi @Bobbins228!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot avatar Feb 14 '24 10:02 facebook-github-bot

What does kueue do that volcano doesn't? Is there a reason to support more than one batch scheduler on kubernetes?

ccharest93 avatar Feb 17 '24 09:02 ccharest93

Hi @ccharest93, We have decided to add a Kueue Batch Job scheduler as Kueue is backed by Kubernetes Batch Sig and this gives users more options in terms of choosing a Kubernetes Scheduler.

Bobbins228 avatar Feb 27 '24 09:02 Bobbins228

Looks like there's also valid pyre/lint issues

d4l3k avatar Mar 26 '24 19:03 d4l3k

@Bobbins228 there's been a bunch of cleanup on tests in on the main branch -- if you rebase/merge all tests should be passing (assuming no new breaking changes)

d4l3k avatar Mar 27 '24 23:03 d4l3k

@Bobbins228 did you try multi-node ddp with it? I think you'd need a JobSet for that which sets up a headless k8s service for workers to reach the master. See example here https://github.com/kubernetes-sigs/jobset/blob/main/examples/pytorch/resnet-cifar10/resnet.yaml

Another way to approach this is perhaps to not be tightly coupled with a specific operator but rather just output a k8s standard Job (or even Pod) spec with the container image and allow it to be piped into other tools that can transform it into whatever that's suitable. But AFAICT something like a --output-format yaml isn't supported in torchx and --dryrun aims to provide human readable instead of tool parsable output, so I don't know if this is realistic?

xujyan avatar Mar 28 '24 01:03 xujyan

@xujyan @Bobbins228 looks like the ddp job on Kueue is having issues

d4l3k avatar Mar 28 '24 22:03 d4l3k

I would expect that. Like I suggested above you'd need to produce a JobSet instead of a Job, and as the example here shows, point to the master at

              - name: MASTER_ADDR
                value: "pytorch-workers-0-0.pytorch"

where the value is of the format <jobset_name>-<job_name>-0-0.<jobset_name>

xujyan avatar Mar 29 '24 00:03 xujyan

And you'd need to install jobset controller in the test env too, see README

xujyan avatar Mar 29 '24 00:03 xujyan