kueue
kueue copied to clipboard
StatefulSet workload is marked finished when all pods are deleted
What happened:
When I delete all pods of a statefulSet, and the pods succeed, then the workload is marked finished.
Also, this is inconsistent, because if the Pods fail, then we don't mark the workload as finished.
It is also inconsistent with Jobs, where a deleted Pods is just recreated, but the workload continues to run.
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
- create the STS:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nginx-statefulset
labels:
app: nginx
kueue.x-k8s.io/queue-name: user-queue
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: registry.k8s.io/nginx-slim:0.26
ports:
- containerPort: 80
resources:
requests:
cpu: "100m"
serviceName: "nginx"
- Track the workload status
- Delete all pods
kubectl delete --all podsIssue: the workload got finished:kubectl get workloads -w --output-watch-events
> kubectl get workloads -w --output-watch-events
EVENT NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
ADDED statefulset-nginx-statefulset-ed050 user-queue cluster-queue True 12m
MODIFIED statefulset-nginx-statefulset-ed050 user-queue cluster-queue True True 12m
MODIFIED statefulset-nginx-statefulset-ed050 user-queue cluster-queue True True 12m
MODIFIED statefulset-nginx-statefulset-ed050 user-queue cluster-queue True True 12m
MODIFIED statefulset-nginx-statefulset-ed050 user-queue cluster-queue True True 12m
DELETED statefulset-nginx-statefulset-ed050 user-queue cluster-queue True True 12m
ADDED statefulset-nginx-statefulset-ed050 user-queue 0s
MODIFIED statefulset-nginx-statefulset-ed050 user-queue 0s
MODIFIED statefulset-nginx-statefulset-ed050 user-queue cluster-queue True 0s
MODIFIED statefulset-nginx-statefulset-ed050 user-queue cluster-queue True 1s
MODIFIED statefulset-nginx-statefulset-ed050 user-queue cluster-queue True 2s
MODIFIED statefulset-nginx-statefulset-ed050 user-queue cluster-queue True 2s
The analogous is true for LWS:
> kubectl get workloads -w --output-watch-events
EVENT NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
ADDED leaderworkerset-leaderworkerset-multi-template-0-57ed8 user-queue cluster-queue True 19s
MODIFIED leaderworkerset-leaderworkerset-multi-template-0-57ed8 user-queue cluster-queue True True 33s
MODIFIED leaderworkerset-leaderworkerset-multi-template-0-57ed8 user-queue cluster-queue True True 33s
MODIFIED leaderworkerset-leaderworkerset-multi-template-0-57ed8 user-queue cluster-queue True True 33s
MODIFIED leaderworkerset-leaderworkerset-multi-template-0-57ed8 user-queue cluster-queue True True 33s
MODIFIED leaderworkerset-leaderworkerset-multi-template-0-57ed8 user-queue cluster-queue True True 33s
DELETED leaderworkerset-leaderworkerset-multi-template-0-57ed8 user-queue cluster-queue True True 33s
ADDED leaderworkerset-leaderworkerset-multi-template-0-57ed8 user-queue 1s
MODIFIED leaderworkerset-leaderworkerset-multi-template-0-57ed8 user-queue 1s
MODIFIED leaderworkerset-leaderworkerset-multi-template-0-57ed8 user-queue cluster-queue True 1s
MODIFIED leaderworkerset-leaderworkerset-multi-template-0-57ed8 user-queue cluster-queue True 1s
Anything else we need to know?:
The fact that the workload is re-created is also a problem (even more important), but hopefully we can decouple the fix to make it easier to track, see my comment here: https://github.com/kubernetes-sigs/kueue/pull/4799/files#r2015748155.
I believe this is the proper fix, but we need an e2e test case for this: https://github.com/kubernetes-sigs/kueue/pull/4799/files#diff-dfb49586a8522fa91d733051fd3b7e4b3ff174907898cf249ab16e2620976a5dR338-R340
/assign @mbobrovskyi As already working on the closely related https://github.com/kubernetes-sigs/kueue/issues/4342
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
@mbobrovskyi any progress on that, or the issue is maybe fixed already?
/remove-lifecycle stale
We’ve already fixed this for LWS: https://github.com/kubernetes-sigs/kueue/pull/4790. The StatefulSet PR is still under review: https://github.com/kubernetes-sigs/kueue/issues/4805.
/retitle StatefulSet workload is marked finished when all pods are deleted Scoping to STS since the LWS is solved per https://github.com/kubernetes-sigs/kueue/issues/4805#issuecomment-3004195651
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale