agones GameServer stuck on state Scheduled when Pod failed with reason OutOfpods

What happened:

Agones didn't create a new Pod when a Pod failed due to reasons OutOfpods, and the GameServer stuck with state Scheduled.

What you expected to happen:

GameServer is expected to create a new Pod if a Pod fails due to reasons of OutOfpods.

How to reproduce it (as minimally and precisely as possible):

Put the following manifest in /etc/kubernetes/manifests/static-pod.manifest of the testing node.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: kube-system
  labels:
    component: nginx
    tier: node
spec:
  hostNetwork: true
  containers:
  - name: nginx
    image: nginx:1.14.2
    imagePullPolicy: IfNotPresent
    ports:
    - containerPort: 80
    resources:
      requests:
        cpu: 100m
  priorityClassName: system-node-critical
  priority: 2000001000
  tolerations:
  - effect: NoExecute
    operator: Exists
  - effect: NoSchedule
    operator: Exists

Set Fleet replicas to pod capacity of the node
Confirm some of the gameserver pods stuck with state Pending.
Forcibly delete static-pod created from step (1) in kube-system.
- kubectl delete pod --force --grace-period=0 <static-pod-name> -n kube-system

All gameserver pods stuck with state Pending become failed with reason OutOfpods.

Anything else we need to know?:

Here is the Pod status that I reproduce.

status:
  message: 'Pod Node didn''t have enough resource: pods, requested: 1, used: 32, capacity:
    32'
  phase: Failed
  reason: OutOfpods

I created the Fleet from official document.

Environment:

Agones version: 1.20.0
Kubernetes version (use kubectl version): Client Version: v1.21.0 Server Version: v1.21.12-gke.1500
Cloud provider or hardware configuration: GKE
Install method (yaml/helm): yaml
Troubleshooting guide log(s):
Others:

Jul 26 '22 11:07 katsew

GameServer is expected to create a new Pod if a Pod fails due to reasons of OutOfpods.

Sorry, but maybe I'm missing something - but how is Agones supposed to create a new Pod if there isn't room in the cluster?

Or do you mean that Agones doesn't recover when there should be room?

Jul 26 '22 16:07 markmandel

I'm also concerned that if you delete all the pods in the kube-system namespace, you are also breaking Kubernetes.

Jul 26 '22 16:07 markmandel

GameServer is expected to create a new Pod if a Pod fails due to reasons of OutOfpods.

Sorry, but maybe I'm missing something - but how is Agones supposed to create a new Pod if there isn't room in the cluster?

Or do you mean that Agones doesn't recover when there should be room?

I mean that Agones doesn't recover when there should be room.

Jul 27 '22 01:07 katsew

I'm also concerned that if you delete all the pods in the kube-system namespace, you are also breaking Kubernetes.

Usually, if a Node has exceeded the Pod's capacity and there are no other Nodes that can be scheduled, the Pod's status is stuck at Pending. Since failure by OutOfpods is rare, it was likely necessary to put the cluster in a broken state in order to reproduce it. As far as I could tell, simply deleting the GameServer did not reproduce it.

Jul 27 '22 01:07 katsew

I have updated the reproduction procedure to be more accurate. By forcibly deleting pods, it can be reproduced in one attempt.

Jul 27 '22 03:07 katsew

I'm curious - what does kubectl delete pod --force --grace-period=0 --all -n kube-system force to happen?

Is that a required step to replicate the issue?

Jul 27 '22 15:07 markmandel

Also, can you share a kubectl describe of the Pod that has failed please as well?

It sounds like we should move the GameServer to Unhealthy if the backing pod moves to an OutOfpods state, but I'm just trying to nail down exactly what is happening here.

Jul 27 '22 15:07 markmandel

I'm curious - what does kubectl delete pod --force --grace-period=0 --all -n kube-system force to happen?

Is that a required step to replicate the issue?

I tried forcibly deleting only the GameServer pods, but still could not reproduce the problem. So I checked which kube-system component is actually causing this OutOfpods. As a result, kube-proxy is causing this problem.

I have updated the confirmed reproduction method.

Jul 28 '22 03:07 katsew

Also, can you share a kubectl describe of the Pod that has failed please as well?

Sure, here it is.

Name:           simple-game-server-qxtcq-wbs6p
Namespace:      default
Priority:       0
Node:           gke-friday-developme-gameserver-pool--f788a1c2-szxx/
Start Time:     Thu, 28 Jul 2022 03:15:22 +0000
Labels:         agones.dev/gameserver=simple-game-server-qxtcq-wbs6p
                agones.dev/role=gameserver
Annotations:    agones.dev/container: simple-game-server
                agones.dev/sdk-version: 1.20.0
                cluster-autoscaler.kubernetes.io/safe-to-evict: false
Status:         Failed
Reason:         OutOfpods
Message:        Pod Node didn't have enough resource: pods, requested: 1, used: 32, capacity: 32
IP:
IPs:            <none>
Controlled By:  GameServer/simple-game-server-qxtcq-wbs6p
Containers:
  agones-gameserver-sidecar:
    Image:      gcr.io/agones-images/agones-sdk:1.20.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --grpc-port=9357
      --http-port=9358
    Requests:
      cpu:     30m
    Liveness:  http-get http://:8080/healthz delay=3s timeout=1s period=3s #success=1 #failure=3
    Environment:
      GAMESERVER_NAME:  simple-game-server-qxtcq-wbs6p
      POD_NAMESPACE:    default (v1:metadata.namespace)
      FEATURE_GATES:    CustomFasSyncInterval=false&Example=true&NodeExternalDNS=true&PlayerAllocationFilter=false&PlayerTracking=false&SDKGracefulTermination=false&StateAllocationFilter=false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-s4rzz (ro)
  simple-game-server:
    Image:      gcr.io/agones-images/simple-game-server:0.13
    Port:       7654/UDP
    Host Port:  7258/UDP
    Requests:
      cpu:     0
      memory:  0
    Liveness:  http-get http://:8080/gshealthz delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:
      AGONES_SDK_GRPC_PORT:  9357
      AGONES_SDK_HTTP_PORT:  9358
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from empty (ro)
Volumes:
  empty:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-s4rzz:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age    From                Message
  ----     ------             ----   ----                -------
  Warning  FailedScheduling   2m44s  default-scheduler   0/1 nodes are available: 1 Too many pods.
  Warning  FailedScheduling   2m42s  default-scheduler   0/1 nodes are available: 1 Too many pods.
  Normal   Scheduled          43s    default-scheduler   Successfully assigned default/simple-game-server-qxtcq-wbs6p to gke-friday-developme-gameserver-pool--f788a1c2-szxx
  Normal   NotTriggerScaleUp  2m44s  cluster-autoscaler  pod didn't trigger scale-up:
  Warning  OutOfpods          44s    kubelet             Node didn't have enough resource: pods, requested: 1, used: 32, capacity: 32

Jul 28 '22 03:07 katsew

Can you reproduce the issue without actively deleting Kubernetes components?

Jul 28 '22 16:07 markmandel

Today I tried to reproduce without forcibly deleting kube-proxy, but I couldn't.

First, I deploy static pod which has similar spec with kube-proxy pod, then delete forcibly, and didn't work.

Here is the set of specs I try to align with kube-proxy.

ownerReferences
priorityClassName
tolerations
nodeName
resource.requests.cpu

The actual Pod resource is this:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: kube-system
  labels:
    component: nginx
    tier: node
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Node
    name: gke-friday-developme-gameserver-pool--f788a1c2-pdj1
    uid: 37d22a61-6e19-4729-bf3e-86a8823c9215
spec:
  containers:
  - name: nginx
    image: nginx:1.14.2
    imagePullPolicy: IfNotPresent
    ports:
    - containerPort: 80
    resources:
      requests:
        cpu: 100m
  nodeName: gke-friday-developme-gameserver-pool--f788a1c2-pdj1
  priorityClassName: system-node-critical
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    operator: Exists
  - effect: NoSchedule
    operator: Exists

Next, I tried setting pods to 0 in ResourceQuota and forcibly deleting GameServer, which also did not work.

# deploy to the same namespace as GameServers exist
apiVersion: v1
kind: ResourceQuota
metadata:
  name: pod-counts
  namespace: default
spec:
  hard:
    pods: "0"

Jul 29 '22 03:07 katsew

Here is the log for kubelet.

Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.860377   14147 kubelet.go:1950] "SyncLoop DELETE" source="api" pods=[kube-system/kube-proxy-gke-friday-developme-gameserver-pool--f788a1c2-pdj1]
Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.860644   14147 kubelet_pods.go:1520] "Generating pod status" pod="kube-system/kube-proxy-gke-friday-developme-gameserver-pool--f788a1c2-pdj1"
Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.861193   14147 kubelet.go:1668] "Trying to delete pod" pod="kube-system/kube-proxy-gke-friday-developme-gameserver-pool--f788a1c2-pdj1" podUID=9c095625-5113-476e-a638-bc8a78a9271b
Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.861335   14147 mirror_client.go:125] "Deleting a mirror pod" pod="kube-system/kube-proxy-gke-friday-developme-gameserver-pool--f788a1c2-pdj1" podUID=0xc000b4b060
Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.872105   14147 config.go:278] "Setting pods for source" source="api"
Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.873388   14147 kubelet.go:1944] "SyncLoop REMOVE" source="api" pods=[kube-system/kube-proxy-gke-friday-developme-gameserver-pool--f788a1c2-pdj1]
Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.883964   14147 config.go:278] "Setting pods for source" source="api"
Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.884289   14147 config.go:383] "Receiving a new pod" pod="default/simple-game-server-psqqb-hnz28"
Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.887060   14147 volume_manager.go:394] "Waiting for volumes to attach and mount for pod" pod="kube-system/kube-proxy-gke-friday-developme-gameserver-pool--f788a1c2-pdj1"
Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.887324   14147 volume_manager.go:425] "All volumes are attached and mounted for pod" pod="kube-system/kube-proxy-gke-friday-developme-gameserver-pool--f788a1c2-pdj1"
Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.887937   14147 kuberuntime_manager.go:711] "computePodActions got for pod" podActions={KillPod:false CreateSandbox:false SandboxID:2d3ba683c2accef9b8e33e0d44ff5602267cb2b4645f5cce78d4f04ff6c20c2a Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[] EphemeralContainersToStart:[]} pod="kube-system/kube-proxy-gke-friday-developme-gameserver-pool--f788a1c2-pdj1"
Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.888478   14147 kubelet.go:1934] "SyncLoop ADD" source="api" pods=[default/simple-game-server-psqqb-hnz28]
Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.888680   14147 topology_manager.go:187] "Topology Admit Handler"
Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.889513   14147 predicate.go:143] "Predicate failed on Pod" pod="simple-game-server-psqqb-hnz28_default(34520514-3a31-4fc2-a039-5727211e7f4b)" err="Node didn't have enough resource: pods, requested: 1, used: 32, capacity: 32"
Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.890657   14147 event.go:291] "Event occurred" object="default/simple-game-server-psqqb-hnz28" kind="Pod" apiVersion="v1" type="Warning" reason="OutOfpods" message="Node didn't have enough resource: pods, requested: 1, used: 32, capacity: 32"
Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.922069   14147 status_manager.go:586] "Patch status for pod" pod="default/simple-game-server-psqqb-hnz28" patchBytes="{\"metadata\":{\"uid\":\"34520514-3a31-4fc2-a039-5727211e7f4b\"},\"status\":{\"conditions\":null,\"message\":\"Pod Node didn't have enough resource: pods, requested: 1, used: 32, capacity: 32\",\"phase\":\"Failed\",\"qosClass\":null,\"reason\":\"OutOfpods\",\"startTime\":\"2022-07-29T01:42:42Z\"}}"
Jul 29 01:42:42 gke-friday-developme-gameserver-pool--f788a1c2-pdj1 kubelet[14147]: I0729 01:42:42.922330   14147 status_manager.go:595] "Status for pod updated successfully" pod="default/simple-game-server-psqqb-hnz28" statusVersion=1 status={Phase:Failed Conditions:[] Message:Pod Node didn't have enough resource: pods, requested: 1, used: 32, capacity: 32 Reason:OutOfpods NominatedNodeName: HostIP: PodIP: PodIPs:[] StartTime:2022-07-29 01:42:42 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[] QOSClass: EphemeralContainerStatuses:[]}

Any Ideas to reproduce?

Jul 29 '22 03:07 katsew

I have updated the reproduction step. I said I deployed a static pod in kube-system, but I actually deploy a bare pod, not a static pod. I created a static pod like this step, then the issue was reproduced without deleting kube-proxy.

Aug 01 '22 07:08 katsew

Maybe a silly question but - why would someone add that manifest to a node? (most people I would expect are on cloud providers and either (a) don't have access or (b) will have it overwritten pretty fast)

Aug 01 '22 16:08 markmandel

According to the documentation, Static Pods are supposed to be used by users to deploy their own control plane components, but I'm not sure if there are other use cases where users actually use them. Just to be clear, I used Static Pods only to reproduce this issue, so I did not use it in a real environment.

Aug 02 '22 10:08 katsew

We're currently workaround this issue by running descheduler to evict Pods failed with Outofpods. However, it delays for a certain time period, since descheduler running with cronjob.

So, I would like to submit a PR that solves this issue, but I don't feel I can put a reproduction method into a test case... 😢

Aug 02 '22 10:08 katsew

Today I encountered the same behavior as this issue when the Pod failed with OutOfcpu. It seems that when the Pod failed in some reasons, the GameServer does not recover automatically 🤔

Aug 16 '22 01:08 katsew

@markmandel

I've been working on this issue and it seems that this issue is caused by the insufficient resource error in kubelet. https://github.com/kubernetes/kubernetes/blob/v1.21.12/pkg/kubelet/lifecycle/predicate.go#L140

My suggestion to fix this issue is that adding condition whether or not Pod is failed by insufficient resource error here. https://github.com/googleforgames/agones/blob/main/pkg/gameservers/health.go#L105

If it's ok to apply this fix, I'll submit a PR for this.

What do you think?

Aug 16 '22 06:08 katsew

I'll be honest, I'm still not understanding what the actual issue here is. It seems like you have to break Kubernetes to make it happen -- which doesn't sound like an actual bug, it sounds like an extreme edge case.

Also, a Pod in pending state is an indicator to the cluster autoscaler that it should expand the cluster - so changing that behaviour says to me that we should leave this alone.

If you can't replicate this issue without actually messing around with the underlying Kubernetes system, I'm not sure we should be considering a fix here?

Aug 17 '22 18:08 markmandel

I too am aware that I have hit an edge case rather than found a bug. We are currently downsize the entire cluster after work hours, and we get it back to where it was before work hours. I feel this makes it easier to step on edge cases.

Also, a Pod in pending state is an indicator to the cluster autoscaler that it should expand the cluster - so changing that behaviour says to me that we should leave this alone.

There may be some misunderstanding in this part. The problem is not the Pod in pending state, but in failed state. The GameServer does not recreate the backing Pod in failed state, GameServer never transition its state from Scheduled to another, even if cluster autoscaler scale out nodes.

This may prevents FleetAutoScaler from working properly, since GameServer stuck in Scheduled state and FleetAutoScaler cannot scale GameServer until manually delete the Pod in failed state.

If you can't replicate this issue without actually messing around with the underlying Kubernetes system, I'm not sure we should be considering a fix here?

Maybe I should ask the sig-node community for help to replicate the issue without killing static pods.

Aug 18 '22 02:08 katsew

I don't know if it's worth mentioning, but a Pod created from Deployment will not get stuck in this state and become Running.
So my thought is that it would be nice if Pods created by GameServer could be recovered in the same way.

Aug 18 '22 03:08 katsew

Gotcha!

So ultimately it sounds like if a Pod status is Failed - we should handle that general case (less of an issue with OutOfPods, but more generally if a Pod is in a Failed state).

I wonder if there is an easy way to just create a Failed Pod somehow, and use that as our test case. It does sound like if a Pod has failed for any reason, it should be moved to Unhealthy anyway.

I had a quick look to see if there was an easy way to make that happen though. Did you have any luck with sig-node?

Aug 24 '22 20:08 markmandel

Thank you for straightening out the issue.

I wonder if there is an easy way to just create a Failed Pod somehow, and use that as our test case. It does sound like if a Pod has failed for any reason, it should be moved to Unhealthy anyway. I had a quick look to see if there was an easy way to make that happen though. Did you have any luck with sig-node?

Sorry, I've been too busy to ask the community for help yet, but I will do so soon.

Aug 25 '22 06:08 katsew

@markmandel

Sorry, It takes too long to ask question. I've posted the question to k8s community slack channel, but couldn't get an answer. I've also tried to create a failed pod, but I have no idea to change pod's status.phase to Failed.

Have you found any way to create a Failed Pod?

Oct 04 '22 02:10 katsew

Looking at: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/

Failed: All containers in the Pod have terminated, and at least one container has terminated in failure. That is, the container either exited with non-zero status or was terminated by the system.

If this happens (assuming there is a Fleet in use), this will kick the GameServer into Unhealthy , and it will then get replaced with a new Pod.

...also, if we can't replicate the issue, is it an issue? 😄

Oct 14 '22 19:10 markmandel

@markmandel

I found a manifest to replicate the issue. We have to set restartPolicy to Never, then exit containers with non-zero status. To exit all containers with non-zero status, I have to add hostPID: true to see all container process id. I also found that exit containers with zero status will stuck on phase Succeeded. Should we handle this case, too? 🤔

apiVersion: "agones.dev/v1"
kind: Fleet
metadata:
  name: simple-game-server
  namespace: default
spec:
  replicas: 1
  template:
    spec:
      ports:
      - name: default
        containerPort: 7654
      template:
        spec:
          hostPID: true
          restartPolicy: Never
          containers:
          - name: simple-game-server
            image: gcr.io/agones-images/simple-game-server:0.13
            command:
              - sh
              - -cx
              - |
                pgrep -n sdk-server | xargs kill -9
                exit 1

Related issue: https://github.com/googleforgames/agones/issues/2361

Oct 17 '22 05:10 katsew

I would like to chime in here as it seems like it's the same issue.

There is a relatively fresh kubernetes feature - https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown It seems like it can lead to pods being transitioned into Failed state too:

Status:           Failed
Reason:           Terminated
Message:          Pod was terminated in response to imminent node shutdown.

I think controlling GameServer should indeed be moved to unhealthy here as a correct reaction.

I don't have a concrete way to reproduce unfortunately, is we encountered an issue in production on a loaded cluster. But I can guess that this will happen if nodes shutdownGracePeriod and shutdownGracePeriodCriticalPods are not 0 (to enable the feature) but not enough to actually terminate containers inside the pod gracefully, due to them having bigger terminationGracePeriodSeconds and actually using it up.

Dec 29 '22 16:12 unlightable

This is all kinds of fun, because of Pod restarts 😄 so I appreciate people digging in.

One thing I'm trying to work out from the documentation is, that if the Pod is in a Failed state, is that the final state?

I.e. do we know if a Pod could restart its way out of a Failed state? @roberthbailey @zmerlynn @gongmax @igooch any of you know? The docs don't seem clear to me.

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase

Jan 03 '23 22:01 markmandel

status.phase is a synthetic view really meant for humans. As pointed out in the link: The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle. The phase is not intended to be a comprehensive rollup of observations of container or Pod state, nor is it intended to be a comprehensive state machine. Really, you can think of status.phase as aggregating the state of each container and potentially other Pod conditions.

That said, with restartPolicy: Never, I would expect Failed to be terminal (except for possibly some nuance around the state of the SDK sidecar container). It would be useful to have the full kubectl get -oyaml for the Pod in question rather than describe view, just to see.

Jan 03 '23 23:01 zmerlynn

Since we run (by default) with restartPolicy: Always on the Pod, we have to assume there is a restart.

@unlightable in your situation, i assume once the node was torn down, the GameServer was replaced on another node?

Jan 03 '23 23:01 markmandel

agones agones copied to clipboard

GameServer stuck on state Scheduled when Pod failed with reason OutOfpods

agones
agones copied to clipboard