scheduler-plugins Pods are stuck in Pending phase when they should all start at same time

Hello, our coscheduling config looks like this:

Name:         scheduler-config
Namespace:    scheduler-plugins
Labels:       app.kubernetes.io/managed-by=Helm
Annotations:  meta.helm.sh/release-name: gang-scheduler
              meta.helm.sh/release-namespace: default

Data
====
scheduler-config.yaml:
----
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: false
profiles:
- schedulerName: scheduler-plugins-scheduler
  plugins:
    queueSort:
      enabled:
      - name: Coscheduling
      disabled:
      - name: "*"
    preFilter:
      enabled:
      - name: Coscheduling
    permit:
      enabled:
      - name: Coscheduling
    reserve:
      enabled:
      - name: Coscheduling
    postBind:
      enabled:
      - name: Coscheduling
  pluginConfig:
  - name: Coscheduling
    args:
      permitWaitingTimeSeconds: 10
      deniedPGExpirationTimeSeconds: 3

Events:  <none>

and when we run a Podgroup of 6 pods, only 1 runs and then the rest of the pods run a lot of time after the first one (it takes 10 mins on average for each pod to run)

this is the error message we see when we describe the pod:

Events:
  Type     Reason            Age        From                         Message
  ----     ------            ----       ----                         -------
  Warning  FailedScheduling  <invalid>  scheduler-plugins-scheduler  0/5 nodes are available: 5 pre-filter pod mission-dispatcher cannot find enough sibling pods, current pods number: 4, minMember of group: 6.
  Warning  FailedScheduling  <invalid>  scheduler-plugins-scheduler  0/5 nodes are available: 5 pre-filter pod mission-dispatcher cannot find enough sibling pods, current pods number: 4, minMember of group: 6.
  Warning  FailedScheduling  <invalid>  scheduler-plugins-scheduler  pod "mission-dispatcher" rejected while waiting on permit: rejected due to timeout after waiting 30s at plugin Coscheduling
  Warning  FailedScheduling  <invalid>  scheduler-plugins-scheduler  pod "mission-dispatcher" rejected while waiting on permit: Coscheduling
  Warning  FailedScheduling  <invalid>  scheduler-plugins-scheduler  0/5 nodes are available: 5 pod with pgName: tegra-silp7bmpstrsfivglz67nliuw/echo last failed in 3s, deny.
  Warning  FailedScheduling  <invalid>  scheduler-plugins-scheduler  pod "mission-dispatcher" rejected while waiting on permit: rejected due to timeout after waiting 30s at plugin Coscheduling
  Warning  FailedScheduling  <invalid>  scheduler-plugins-scheduler  pod "mission-dispatcher" rejected while waiting on permit: rejected due to timeout after waiting 30s at plugin Coscheduling
  Warning  FailedScheduling  <invalid>  scheduler-plugins-scheduler  pod "mission-dispatcher" rejected while waiting on permit: rejected due to timeout after waiting 30s at plugin Coscheduling

is something messed up in our config ?

Mar 17 '22 20:03 karanchahal

Can you also provide the pod group yaml here?

Mar 18 '22 06:03 denkensk

.

Mar 18 '22 16:03 karanchahal

I find the Failed in Status is 1. Is there one pod in a failure state？Can you get the pods by kubectl?

Status:
  Failed:               1
  Phase:                Scheduling
  Running:              1
  Schedule Start Time:  2022-03-19T00:10:54Z
  Scheduled:            1
  Succeeded:            0

Can you describe the details of the process? like

When you create the PodGroup before or after the object created？
Does it schedule successfully at first？ I am curious why there is one pod is failed. If the pods can't be scheduled, all of them will be pending.

Mar 21 '22 03:03 denkensk

Is this gang-scheduling, that you want all 6 pods be scheduled together? @karanchahal

This condition may happen in the first scheduling, like https://github.com/kubernetes-sigs/scheduler-plugins/issues/364, for the same reason.

And it's also likely happen when all 6 pod scheduled, but some of them need to re-created due to some reason(maybe ha, etc...), then those re-created pods need to re-scheduled, but there is big chance that the num of pods less than 6 in perFilter, where those pods will be rejected.

There is a simple way maybe could avoid this situation: change the PodGroupManager.PerFilter like this:

func (pgMgr *PodGroupManager) PreFilter(ctx context.Context, pod *corev1.Pod) error {
	...
	// let the re-created or newly created pod fast pass, where they be created within 3s
	if time.Now().After(pod.CreationTimestamp.Time.Add(3 * time.Second)) {
		pods, err := pgMgr.podLister.Pods(pod.Namespace).List(
			labels.SelectorFromSet(labels.Set{khaos.PodGroupLabelKey: GetPodGroupLabel(pod)}),
		)
		if err != nil {
			return fmt.Errorf("podLister list pods failed: %v", err)
		}
		if len(pods) < int(pg.Spec.MinMember) {
			return fmt.Errorf("pre-filter pod %v cannot find enough sibling pods, "+
				"current pods number: %v, minMember of group: %v", pod.Name, len(pods), pg.Spec.MinMember)
		}
	}
	if pg.Spec.MinResources == nil {
		return nil
	}
        ...
}

Mar 29 '22 11:03 NoicFank

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 27 '22 12:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jul 27 '22 13:07 k8s-triage-robot

Hello, is this issue fixed, on the empty cluster we are seeing about 70 secs to schedule two pods with co-scheduler.

Aug 12 '22 17:08 asm582

Hello, is this issue fixed, on the empty cluster we are seeing about 70 secs to schedule two pods with co-scheduler.

Yes, we do, the bug has different symptom on different versions. Starting with v0.22.6, it's expected to basically gone. And in the upcoming v0.23.X, the code start time is even boosted. Which version are you using?

Aug 13 '22 01:08 Huang-Wei

Hello, is this issue fixed, on the empty cluster we are seeing about 70 secs to schedule two pods with co-scheduler.

Yes, we do, the bug has different symptom on different versions. Starting with v0.22.6, it's expected to basically gone. And in the upcoming v0.23.X, the code start time is even boosted. Which version are you using?

@Huang-Wei we are using the version: [k8s.gcr.io/scheduler-plugins/controller:v0.22.6](http://k8s.gcr.io/scheduler-plugins/controller:v0.22.6) but we still see the issue

Aug 13 '22 17:08 asm582

It seems this case is described in #408, which is common when the system is pretty idle (means no pods are queuing):

if the scheduler acts very fast to schedule pod 1 and the other pods in the same group has not been created yet, we would fail PreFilter due to not reaching required min number and would add the PodGroup to the lastDeniedPG in PostFilter. Due the 3s timeout cache used for lastDeniedPG, if the other pods in the PG were created within those 3s, they all would be rejected during PreFilter due to lastDeniedPG.

We have a fix but not available for 0.22 release, I will cut a new 0.23.10 release when k8s v1.23.10 is out (which should be in a couple days): https://kubernetes.io/releases/patch-releases/#1-23.

Aug 16 '22 01:08 Huang-Wei

@Huang-Wei Thanks, I am willing to try it out.

Aug 16 '22 01:08 asm582

/remove-lifecycle rotten

Aug 18 '22 17:08 Huang-Wei

Hello @Huang-Wei , curious to know if we have a new release out that we can try.

Aug 18 '22 18:08 asm582

@asm582 I'm doing some verification, and will promote it as an official release later today, it works fine locally:

weih@m1max:~/manifests/PodGroup/release-verify|⇒  k scale deploy pause --replicas=3
deployment.apps/pause scaled
weih@m1max:~/manifests/PodGroup/release-verify|⇒  k get po
NAME                     READY   STATUS    RESTARTS   AGE
pause-64f58bcb89-kdmtw   1/1     Running   0          2s
pause-64f58bcb89-qmqlz   1/1     Running   0          2s
pause-64f58bcb89-s4wq5   1/1     Running   0          2s
weih@m1max:~/manifests/PodGroup/release-verify|⇒  k scale deploy pause --replicas=0
deployment.apps/pause scaled
weih@m1max:~/manifests/PodGroup/release-verify|⇒  k scale deploy pause --replicas=2
deployment.apps/pause scaled
weih@m1max:~/manifests/PodGroup/release-verify|⇒  k get po
NAME                     READY   STATUS    RESTARTS   AGE
pause-64f58bcb89-gb44g   0/1     Pending   0          2s
pause-64f58bcb89-ppck5   0/1     Pending   0          2s
weih@m1max:~/manifests/PodGroup/release-verify|⇒  k scale deploy pause --replicas=3
deployment.apps/pause scaled
weih@m1max:~/manifests/PodGroup/release-verify|⇒  k get po
NAME                     READY   STATUS    RESTARTS   AGE
pause-64f58bcb89-gb44g   1/1     Running   0          10s
pause-64f58bcb89-m65f8   1/1     Running   0          2s
pause-64f58bcb89-ppck5   1/1     Running   0          10s

Aug 18 '22 20:08 Huang-Wei

Thanks @Huang-Wei , It looks like you create podgroup before submitting the deployment in your testing. If possible can you test by submitting new deployments that have new podgroups and can we do testing in general for replicas >10?

Aug 18 '22 20:08 asm582

@asm582 v0.23.10 is live, you can refer to https://github.com/kubernetes-sigs/scheduler-plugins/pull/414 for latest setup.

It looks like you create podgroup before submitting the deployment

That doesn't quite matter as if a PodGroup is not acked scheduler while the pods carry PG labels, the pods will be marked as unschedulable.

Aug 18 '22 22:08 Huang-Wei

@asm582 I did give it a try, let me know if this is what you meant:

weih@m1max:~/manifests/PodGroup/release-verify|⇒  cat pg-and-deploy.yaml
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: pg1
spec:
  scheduleTimeoutSeconds: 10
  minMember: 10
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pause
spec:
  replicas: 10
  selector:
    matchLabels:
      app: pause
  template:
    metadata:
      labels:
        app: pause
        pod-group.scheduling.sigs.k8s.io: pg1
    spec:
      containers:
      - name: pause
        image: k8s.gcr.io/pause:3.6
weih@m1max:~/manifests/PodGroup/release-verify|⇒  k apply -f pg-and-deploy.yaml
podgroup.scheduling.sigs.k8s.io/pg1 created
deployment.apps/pause created
weih@m1max:~/manifests/PodGroup/release-verify|⇒  k get po
NAME                     READY   STATUS              RESTARTS   AGE
pause-64f58bcb89-4n8nh   0/1     Pending             0          2s
pause-64f58bcb89-5h4xg   0/1     ContainerCreating   0          2s
pause-64f58bcb89-5hdf2   0/1     Pending             0          2s
pause-64f58bcb89-7jqsg   0/1     Pending             0          2s
pause-64f58bcb89-bfs2b   0/1     ContainerCreating   0          2s
pause-64f58bcb89-cgtb4   0/1     ContainerCreating   0          2s
pause-64f58bcb89-qdmx4   0/1     ContainerCreating   0          2s
pause-64f58bcb89-qww6r   0/1     Pending             0          2s
pause-64f58bcb89-vmtdf   0/1     ContainerCreating   0          2s
pause-64f58bcb89-zrsbk   0/1     Pending             0          2s
weih@m1max:~/manifests/PodGroup/release-verify|⇒  k get po
NAME                     READY   STATUS              RESTARTS   AGE
pause-64f58bcb89-4n8nh   1/1     Running             0          7s
pause-64f58bcb89-5h4xg   1/1     Running             0          7s
pause-64f58bcb89-5hdf2   1/1     Running             0          7s
pause-64f58bcb89-7jqsg   1/1     Running             0          7s
pause-64f58bcb89-bfs2b   0/1     ContainerCreating   0          7s
pause-64f58bcb89-cgtb4   0/1     ContainerCreating   0          7s
pause-64f58bcb89-qdmx4   1/1     Running             0          7s
pause-64f58bcb89-qww6r   1/1     Running             0          7s
pause-64f58bcb89-vmtdf   0/1     ContainerCreating   0          7s
pause-64f58bcb89-zrsbk   1/1     Running             0          7s
weih@m1max:~/manifests/PodGroup/release-verify|⇒  k get po
NAME                     READY   STATUS    RESTARTS   AGE
pause-64f58bcb89-4n8nh   1/1     Running   0          9s
pause-64f58bcb89-5h4xg   1/1     Running   0          9s
pause-64f58bcb89-5hdf2   1/1     Running   0          9s
pause-64f58bcb89-7jqsg   1/1     Running   0          9s
pause-64f58bcb89-bfs2b   1/1     Running   0          9s
pause-64f58bcb89-cgtb4   1/1     Running   0          9s
pause-64f58bcb89-qdmx4   1/1     Running   0          9s
pause-64f58bcb89-qww6r   1/1     Running   0          9s
pause-64f58bcb89-vmtdf   1/1     Running   0          9s
pause-64f58bcb89-zrsbk   1/1     Running   0          9s

Aug 18 '22 22:08 Huang-Wei

@Huang-Wei In version 0.22.6 we get a lot of failed scheduling events, I hope they are resolved in your latest release and you don't see them in your tests:

  Warning  FailedScheduling  96s   scheduler-plugins-scheduler  0/10 nodes are available: 10 pod with pgName: default/test-pg last failed in 3s, deny.
  Warning  FailedScheduling  6s    scheduler-plugins-scheduler  0/10 nodes are available: 10 pod with pgName: default/test-pg last failed in 3s, deny.

Aug 19 '22 12:08 asm582

@asm582 not quite surprising to me and that is exactly what 0.23.10 resolved.

Aug 19 '22 17:08 Huang-Wei