Pods are stuck in Pending phase when they should all start at same time
Hello, our coscheduling config looks like this:
Name: scheduler-config
Namespace: scheduler-plugins
Labels: app.kubernetes.io/managed-by=Helm
Annotations: meta.helm.sh/release-name: gang-scheduler
meta.helm.sh/release-namespace: default
Data
====
scheduler-config.yaml:
----
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: false
profiles:
- schedulerName: scheduler-plugins-scheduler
plugins:
queueSort:
enabled:
- name: Coscheduling
disabled:
- name: "*"
preFilter:
enabled:
- name: Coscheduling
permit:
enabled:
- name: Coscheduling
reserve:
enabled:
- name: Coscheduling
postBind:
enabled:
- name: Coscheduling
pluginConfig:
- name: Coscheduling
args:
permitWaitingTimeSeconds: 10
deniedPGExpirationTimeSeconds: 3
Events: <none>
and when we run a Podgroup of 6 pods, only 1 runs and then the rest of the pods run a lot of time after the first one (it takes 10 mins on average for each pod to run)
this is the error message we see when we describe the pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <invalid> scheduler-plugins-scheduler 0/5 nodes are available: 5 pre-filter pod mission-dispatcher cannot find enough sibling pods, current pods number: 4, minMember of group: 6.
Warning FailedScheduling <invalid> scheduler-plugins-scheduler 0/5 nodes are available: 5 pre-filter pod mission-dispatcher cannot find enough sibling pods, current pods number: 4, minMember of group: 6.
Warning FailedScheduling <invalid> scheduler-plugins-scheduler pod "mission-dispatcher" rejected while waiting on permit: rejected due to timeout after waiting 30s at plugin Coscheduling
Warning FailedScheduling <invalid> scheduler-plugins-scheduler pod "mission-dispatcher" rejected while waiting on permit: Coscheduling
Warning FailedScheduling <invalid> scheduler-plugins-scheduler 0/5 nodes are available: 5 pod with pgName: tegra-silp7bmpstrsfivglz67nliuw/echo last failed in 3s, deny.
Warning FailedScheduling <invalid> scheduler-plugins-scheduler pod "mission-dispatcher" rejected while waiting on permit: rejected due to timeout after waiting 30s at plugin Coscheduling
Warning FailedScheduling <invalid> scheduler-plugins-scheduler pod "mission-dispatcher" rejected while waiting on permit: rejected due to timeout after waiting 30s at plugin Coscheduling
Warning FailedScheduling <invalid> scheduler-plugins-scheduler pod "mission-dispatcher" rejected while waiting on permit: rejected due to timeout after waiting 30s at plugin Coscheduling
is something messed up in our config ?
Can you also provide the pod group yaml here?
.
I find the Failed in Status is 1. Is there one pod in a failure state?Can you get the pods by kubectl?
Status:
Failed: 1
Phase: Scheduling
Running: 1
Schedule Start Time: 2022-03-19T00:10:54Z
Scheduled: 1
Succeeded: 0
Can you describe the details of the process? like
- When you create the
PodGroupbefore or after the object created? - Does it schedule successfully at first? I am curious why there is one pod is failed. If the pods can't be scheduled, all of them will be pending.
Is this gang-scheduling, that you want all 6 pods be scheduled together? @karanchahal
This condition may happen in the first scheduling, like https://github.com/kubernetes-sigs/scheduler-plugins/issues/364, for the same reason.
And it's also likely happen when all 6 pod scheduled, but some of them need to re-created due to some reason(maybe ha, etc...), then those re-created pods need to re-scheduled, but there is big chance that the num of pods less than 6 in perFilter, where those pods will be rejected.
There is a simple way maybe could avoid this situation:
change the PodGroupManager.PerFilter like this:
func (pgMgr *PodGroupManager) PreFilter(ctx context.Context, pod *corev1.Pod) error {
...
// let the re-created or newly created pod fast pass, where they be created within 3s
if time.Now().After(pod.CreationTimestamp.Time.Add(3 * time.Second)) {
pods, err := pgMgr.podLister.Pods(pod.Namespace).List(
labels.SelectorFromSet(labels.Set{khaos.PodGroupLabelKey: GetPodGroupLabel(pod)}),
)
if err != nil {
return fmt.Errorf("podLister list pods failed: %v", err)
}
if len(pods) < int(pg.Spec.MinMember) {
return fmt.Errorf("pre-filter pod %v cannot find enough sibling pods, "+
"current pods number: %v, minMember of group: %v", pod.Name, len(pods), pg.Spec.MinMember)
}
}
if pg.Spec.MinResources == nil {
return nil
}
...
}
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
Hello, is this issue fixed, on the empty cluster we are seeing about 70 secs to schedule two pods with co-scheduler.
Hello, is this issue fixed, on the empty cluster we are seeing about 70 secs to schedule two pods with co-scheduler.
Yes, we do, the bug has different symptom on different versions. Starting with v0.22.6, it's expected to basically gone. And in the upcoming v0.23.X, the code start time is even boosted. Which version are you using?
Hello, is this issue fixed, on the empty cluster we are seeing about 70 secs to schedule two pods with co-scheduler.
Yes, we do, the bug has different symptom on different versions. Starting with v0.22.6, it's expected to basically gone. And in the upcoming v0.23.X, the code start time is even boosted. Which version are you using?
@Huang-Wei we are using the version: [k8s.gcr.io/scheduler-plugins/controller:v0.22.6](http://k8s.gcr.io/scheduler-plugins/controller:v0.22.6) but we still see the issue
It seems this case is described in #408, which is common when the system is pretty idle (means no pods are queuing):
if the scheduler acts very fast to schedule pod 1 and the other pods in the same group has not been created yet, we would fail PreFilter due to not reaching required min number and would add the PodGroup to the lastDeniedPG in PostFilter. Due the 3s timeout cache used for lastDeniedPG, if the other pods in the PG were created within those 3s, they all would be rejected during PreFilter due to lastDeniedPG.
We have a fix but not available for 0.22 release, I will cut a new 0.23.10 release when k8s v1.23.10 is out (which should be in a couple days): https://kubernetes.io/releases/patch-releases/#1-23.
@Huang-Wei Thanks, I am willing to try it out.
/remove-lifecycle rotten
Hello @Huang-Wei , curious to know if we have a new release out that we can try.
@asm582 I'm doing some verification, and will promote it as an official release later today, it works fine locally:
weih@m1max:~/manifests/PodGroup/release-verify|⇒ k scale deploy pause --replicas=3
deployment.apps/pause scaled
weih@m1max:~/manifests/PodGroup/release-verify|⇒ k get po
NAME READY STATUS RESTARTS AGE
pause-64f58bcb89-kdmtw 1/1 Running 0 2s
pause-64f58bcb89-qmqlz 1/1 Running 0 2s
pause-64f58bcb89-s4wq5 1/1 Running 0 2s
weih@m1max:~/manifests/PodGroup/release-verify|⇒ k scale deploy pause --replicas=0
deployment.apps/pause scaled
weih@m1max:~/manifests/PodGroup/release-verify|⇒ k scale deploy pause --replicas=2
deployment.apps/pause scaled
weih@m1max:~/manifests/PodGroup/release-verify|⇒ k get po
NAME READY STATUS RESTARTS AGE
pause-64f58bcb89-gb44g 0/1 Pending 0 2s
pause-64f58bcb89-ppck5 0/1 Pending 0 2s
weih@m1max:~/manifests/PodGroup/release-verify|⇒ k scale deploy pause --replicas=3
deployment.apps/pause scaled
weih@m1max:~/manifests/PodGroup/release-verify|⇒ k get po
NAME READY STATUS RESTARTS AGE
pause-64f58bcb89-gb44g 1/1 Running 0 10s
pause-64f58bcb89-m65f8 1/1 Running 0 2s
pause-64f58bcb89-ppck5 1/1 Running 0 10s
Thanks @Huang-Wei , It looks like you create podgroup before submitting the deployment in your testing. If possible can you test by submitting new deployments that have new podgroups and can we do testing in general for replicas >10?
@asm582 v0.23.10 is live, you can refer to https://github.com/kubernetes-sigs/scheduler-plugins/pull/414 for latest setup.
It looks like you create podgroup before submitting the deployment
That doesn't quite matter as if a PodGroup is not acked scheduler while the pods carry PG labels, the pods will be marked as unschedulable.
@asm582 I did give it a try, let me know if this is what you meant:
weih@m1max:~/manifests/PodGroup/release-verify|⇒ cat pg-and-deploy.yaml
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
name: pg1
spec:
scheduleTimeoutSeconds: 10
minMember: 10
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: pause
spec:
replicas: 10
selector:
matchLabels:
app: pause
template:
metadata:
labels:
app: pause
pod-group.scheduling.sigs.k8s.io: pg1
spec:
containers:
- name: pause
image: k8s.gcr.io/pause:3.6
weih@m1max:~/manifests/PodGroup/release-verify|⇒ k apply -f pg-and-deploy.yaml
podgroup.scheduling.sigs.k8s.io/pg1 created
deployment.apps/pause created
weih@m1max:~/manifests/PodGroup/release-verify|⇒ k get po
NAME READY STATUS RESTARTS AGE
pause-64f58bcb89-4n8nh 0/1 Pending 0 2s
pause-64f58bcb89-5h4xg 0/1 ContainerCreating 0 2s
pause-64f58bcb89-5hdf2 0/1 Pending 0 2s
pause-64f58bcb89-7jqsg 0/1 Pending 0 2s
pause-64f58bcb89-bfs2b 0/1 ContainerCreating 0 2s
pause-64f58bcb89-cgtb4 0/1 ContainerCreating 0 2s
pause-64f58bcb89-qdmx4 0/1 ContainerCreating 0 2s
pause-64f58bcb89-qww6r 0/1 Pending 0 2s
pause-64f58bcb89-vmtdf 0/1 ContainerCreating 0 2s
pause-64f58bcb89-zrsbk 0/1 Pending 0 2s
weih@m1max:~/manifests/PodGroup/release-verify|⇒ k get po
NAME READY STATUS RESTARTS AGE
pause-64f58bcb89-4n8nh 1/1 Running 0 7s
pause-64f58bcb89-5h4xg 1/1 Running 0 7s
pause-64f58bcb89-5hdf2 1/1 Running 0 7s
pause-64f58bcb89-7jqsg 1/1 Running 0 7s
pause-64f58bcb89-bfs2b 0/1 ContainerCreating 0 7s
pause-64f58bcb89-cgtb4 0/1 ContainerCreating 0 7s
pause-64f58bcb89-qdmx4 1/1 Running 0 7s
pause-64f58bcb89-qww6r 1/1 Running 0 7s
pause-64f58bcb89-vmtdf 0/1 ContainerCreating 0 7s
pause-64f58bcb89-zrsbk 1/1 Running 0 7s
weih@m1max:~/manifests/PodGroup/release-verify|⇒ k get po
NAME READY STATUS RESTARTS AGE
pause-64f58bcb89-4n8nh 1/1 Running 0 9s
pause-64f58bcb89-5h4xg 1/1 Running 0 9s
pause-64f58bcb89-5hdf2 1/1 Running 0 9s
pause-64f58bcb89-7jqsg 1/1 Running 0 9s
pause-64f58bcb89-bfs2b 1/1 Running 0 9s
pause-64f58bcb89-cgtb4 1/1 Running 0 9s
pause-64f58bcb89-qdmx4 1/1 Running 0 9s
pause-64f58bcb89-qww6r 1/1 Running 0 9s
pause-64f58bcb89-vmtdf 1/1 Running 0 9s
pause-64f58bcb89-zrsbk 1/1 Running 0 9s
@Huang-Wei In version 0.22.6 we get a lot of failed scheduling events, I hope they are resolved in your latest release and you don't see them in your tests:
Warning FailedScheduling 96s scheduler-plugins-scheduler 0/10 nodes are available: 10 pod with pgName: default/test-pg last failed in 3s, deny.
Warning FailedScheduling 6s scheduler-plugins-scheduler 0/10 nodes are available: 10 pod with pgName: default/test-pg last failed in 3s, deny.
@asm582 not quite surprising to me and that is exactly what 0.23.10 resolved.