torchx icon indicating copy to clipboard operation
torchx copied to clipboard

Add tolerations to KubernetesScheduler run opts

Open JackWittmayer opened this issue 6 months ago • 1 comments

Description

Similar to #1067, users should be able to specify which tolerations they would like their job pods to have.

Motivation/Background

This will allow users to run jobs on tainted nodes for testing like hardware validation. It also increases the flexibility of the Kubernetes cluster by allowing operators to prevent certain pods from being scheduled while still allowing runs from Torchx.

Detailed Proposal

Add tolerations as a run-opt to the KubernetesScheduler run_opts, KubernetesOpts and other entry points. Add user-specified tolerations to the role_to_pod method.

Alternatives

I can't think of any alternatives. As far as I know, there is no built-in support for custom tolerations currently. Tolerations significantly change the pod scheduling behavior, so they should only be added when the user requests them.

Additional context/links

Relevant code linked above. Documentation: https://docs.pytorch.org/torchx/main/schedulers/kubernetes.html

JackWittmayer avatar May 19 '25 16:05 JackWittmayer

Same here, do you have any objection, @kiukchung , @d4l3k, @tonykao8080, @andywag ?

clumsy avatar May 21 '25 14:05 clumsy

Just want to check if there are any concerns - otherwise I can contribute @kiukchung , @d4l3k, @tonykao8080, @andywag

clumsy avatar Oct 15 '25 18:10 clumsy

@clumsy this sounds fine to me as well

d4l3k avatar Oct 15 '25 23:10 d4l3k