torchx
torchx copied to clipboard
Added Kueue Job Scheduler
Added a Kubernetes Kueue batch job scheduler based on the Kubernetes Scheduler
Note: variable local_kueue="local-kueue-name" is required in the scheduler args for the queue-name label and for priority add kueue_priority_class="kueue-priority-class-name" to the sheduler args
Manual Testing
Set up a Kubernetes/Openshift cluster
- Install Kueue
- Create necessary Kueue resources: Cluster Queue, Resource Flavour, Local Queue and workload priority classes for testing job priority
- Clone this branch locally and create a torchx whl file with
python3 setup.py bdist_wheel - Install the whl file with
pip install dist/torchx-0.7.0.dev0-py3-none-any.whl - Authenticate to your K8s cluster
- Run a sample job with
torchx run --scheduler kueue_job --scheduler_args namespace=default,local_kueue="default-kueue",image_repo="user/alpine" utils.echo --image alpine:latest --msg hello- should return something likekueue_job://torchx_user/1234 - Get the jobs status with
torchx status kueue_job://torchx_user/1234 - All jobs created with this scheduler should gain the
Suspended/JobResumedstatus - Users can add custom annotations in the scheduler args following this scheme
"annotations": {"key":"value"}
Integration test
- Start the minikube setup script with
bash setup_minikube_kueue.sh - Run the test file with
python scripts/kueue_test.py --container_repo localhost:5000/torchx - For dry_run run
python scripts/kueue_test.py --container_repo localhost:5000/torchx --dryrun
Test plan:
Created Unit tests based on Kubernetes Unit tests.
Hi @Bobbins228!
Thank you for your pull request and welcome to our community.
Action Required
In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!
What does kueue do that volcano doesn't? Is there a reason to support more than one batch scheduler on kubernetes?
Hi @ccharest93, We have decided to add a Kueue Batch Job scheduler as Kueue is backed by Kubernetes Batch Sig and this gives users more options in terms of choosing a Kubernetes Scheduler.
Looks like there's also valid pyre/lint issues
@Bobbins228 there's been a bunch of cleanup on tests in on the main branch -- if you rebase/merge all tests should be passing (assuming no new breaking changes)
@Bobbins228 did you try multi-node ddp with it? I think you'd need a JobSet for that which sets up a headless k8s service for workers to reach the master. See example here https://github.com/kubernetes-sigs/jobset/blob/main/examples/pytorch/resnet-cifar10/resnet.yaml
Another way to approach this is perhaps to not be tightly coupled with a specific operator but rather just output a k8s standard Job (or even Pod) spec with the container image and allow it to be piped into other tools that can transform it into whatever that's suitable. But AFAICT something like a --output-format yaml isn't supported in torchx and --dryrun aims to provide human readable instead of tool parsable output, so I don't know if this is realistic?
@xujyan @Bobbins228 looks like the ddp job on Kueue is having issues
I would expect that. Like I suggested above you'd need to produce a JobSet instead of a Job, and as the example here shows, point to the master at
- name: MASTER_ADDR
value: "pytorch-workers-0-0.pytorch"
where the value is of the format <jobset_name>-<job_name>-0-0.<jobset_name>
And you'd need to install jobset controller in the test env too, see README