xpk icon indicating copy to clipboard operation
xpk copied to clipboard

Consider configuring kueue waitForPodsReady

Open avrittrohwer opened this issue 5 months ago • 3 comments

kueue supports all-or-nothing scheduling: https://kueue.sigs.k8s.io/docs/tasks/manage/setup_wait_for_pods_ready/

Large multi-pod workloads that need every pod to be running to make progress (e.g. single-program-multi-data workloads) can deadlock capacity if the physical availability of resources does not match the configured kueue quotas. The kueue waitForPodsReady feature configures kueue to additionally monitor pod readiness condition for workloads. If not all pods become ready within a configured timeout, the workload is evicted and requeued.

avrittrohwer avatar Sep 23 '24 17:09 avrittrohwer