kuberhealthy icon indicating copy to clipboard operation
kuberhealthy copied to clipboard

deployment pods are left in Terminating state

Open davidkarlsen opened this issue 4 years ago • 13 comments

Describe the bug From time to time there will be pods stuck on terminating:

k -n kuberhealthy get pods
NAME                                     READY   STATUS        RESTARTS   AGE
deployment-deployment-7c775dfbdd-6csn5   0/1     Terminating   0          5d17h

it's always the deployment one, never the daemonset

Steps To Reproduce

  • Run kuberhealthy

Expected behavior All test pods should be removed properly

Screenshots N/A

Versions

  • OCP 4.7
  • Kubernetes Version: 1.20
  • Kuberhealthy Release or build [e.g. 0.1.5 or 235]

Additional context chart install params:

spec:
    chart:
      spec:
        chart: kuberhealthy
        sourceRef:
          kind: HelmRepository
          name: kuberhealthy
          namespace: flux-system
        version: "77"
    interval: 10m
    values:
      imageRegistry: docker-nexus.finods.com/kuberhealthy
      podDisruptionBudget:
        enabled: false
      prometheus:
        enabled: true
        serviceMonitor:
          enabled: true
      resources:
        requests:
          cpu: 10m

davidkarlsen avatar Aug 24 '21 11:08 davidkarlsen

Maybe related? https://github.com/kuberhealthy/kuberhealthy/issues/254

davidkarlsen avatar Aug 24 '21 11:08 davidkarlsen

Thanks for sharing this!

Could you please post logs for the deployment pod and the events for the deployment-deployment pod? What version of the deployment-check are you using?

jonnydawg avatar Aug 24 '21 17:08 jonnydawg

Could you please post logs for the deployment pod and the events for the deployment-deployment pod?

I'll wait for one to hang. Just cleared them out today.

What version of the deployment-check are you using?

v1.9.0

davidkarlsen avatar Aug 25 '21 17:08 davidkarlsen

here's one

iority:                  0
Node:                      alt-ksx-r-c01oco03/139.114.216.183
Start Time:                Fri, 20 Aug 2021 19:20:02 +0200
Labels:                    deployment-timestamp=unix-1629479963
                           pod-template-hash=56ff574674
                           source=kuberhealthy
Annotations:               k8s.v1.cni.cncf.io/network-status:
                             [{
                                 "name": "",
                                 "interface": "eth0",
                                 "ips": [
                                     "10.200.6.98"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
                           k8s.v1.cni.cncf.io/networks-status:
                             [{
                                 "name": "",
                                 "interface": "eth0",
                                 "ips": [
                                     "10.200.6.98"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
                           openshift.io/scc: anyuid
Status:                    Terminating (lasts 5d4h)
Termination Grace Period:  1s
IP:                        
IPs:                       <none>
Controlled By:             ReplicaSet/deployment-deployment-56ff574674
Containers:
  deployment-container:
    Container ID:   
    Image:          nginxinc/nginx-unprivileged:1.17.9
    Image ID:       
    Port:           8080/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     75m
      memory:  75Mi
    Requests:
      cpu:        15m
      memory:     20Mi
    Liveness:     tcp-socket :8080 delay=2s timeout=2s period=15s #success=1 #failure=5
    Readiness:    tcp-socket :8080 delay=2s timeout=2s period=15s #success=1 #failure=5
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-m5jf4 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-m5jf4:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-m5jf4
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none
k logs deployment-deployment-56ff574674-z49b5
Error from server (BadRequest): container "deployment-container" in pod "deployment-deployment-56ff574674-z49b5" is waiting to start: ContainerCreating

davidkarlsen avatar Aug 25 '21 21:08 davidkarlsen

Did you happen to get the logs from the deployment-xxxxxx pod?

jonnydawg avatar Aug 27 '21 19:08 jonnydawg

See above - second section. That won't be output due to the status.

davidkarlsen avatar Aug 27 '21 19:08 davidkarlsen

I mean the other pod -- there is a pod that will be named deployment-xxxxx that is created by the kuberhealthy master. Thedeployment-xxxxx is the pod that handles creating this deployment-deployment-xxxxx pod. It's a little confusing, but the pod that you have posted is part of the nginx deployment that is created by the deployment-xxxxx -- there might be more information in the logs there.

Anyways, this should be enough to start investigating -- thanks!

jonnydawg avatar Aug 27 '21 19:08 jonnydawg

Nothing too interesting, maybe this?



level=info msg="Could not delete service: deployment-svc"
--
  | September 3rd 2021, 17:15:45.829 | time="2021-09-03T15:15:45Z" level=info msg="Found an old deployment belonging to this check: deployment-deployment"

  | September 3rd 2021, 17:15:45.829 | time="2021-09-03T15:15:45Z" level=info msg="Found previous deployment."

davidkarlsen avatar Sep 09 '21 21:09 davidkarlsen

@jonnydawg do you need more info? What can be the cause of this?

davidkarlsen avatar Oct 18 '21 00:10 davidkarlsen

@davidkarlsen I have not been able to reproduce this issue yet. Currently working on getting a test openshift cluster to test on.

jonnydawg avatar Oct 19 '21 18:10 jonnydawg

@davidkarlsen Are you running any custom validating or mutating webhooks in this cluster?

jonnydawg avatar Nov 16 '21 21:11 jonnydawg

@jonnydawg

k get mutatingwebhookconfigurations,validatingwebhookconfigurations
NAME                                                                                                                        WEBHOOKS   AGE
mutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook                                              1          341d
mutatingwebhookconfiguration.admissionregistration.k8s.io/inmemorychannel.eventing.knative.dev                              1          61d
mutatingwebhookconfiguration.admissionregistration.k8s.io/machine-api                                                       2          348d
mutatingwebhookconfiguration.admissionregistration.k8s.io/mutating-knative-openshift                                        2          313d
mutatingwebhookconfiguration.admissionregistration.k8s.io/mutating.knativeeventings.operator.serverless.openshift.iozgvc6   1          27d
mutatingwebhookconfiguration.admissionregistration.k8s.io/mutating.knativekafkas.operator.serverless.openshift.io-xq4wf     1          27d
mutatingwebhookconfiguration.admissionregistration.k8s.io/mutating.knativeservings.operator.serverless.openshift.io-7sj5b   1          27d
mutatingwebhookconfiguration.admissionregistration.k8s.io/sinkbindings.webhook.sources.knative.dev                          1          313d
mutatingwebhookconfiguration.admissionregistration.k8s.io/vault-agent-injector-cfg                                          1          279d
mutatingwebhookconfiguration.admissionregistration.k8s.io/webhook.domainmapping.serving.knative.dev                         1          272d
mutatingwebhookconfiguration.admissionregistration.k8s.io/webhook.eventing.knative.dev                                      1          313d
mutatingwebhookconfiguration.admissionregistration.k8s.io/webhook.serving.knative.dev                                       1          313d

NAME                                                                                                                          WEBHOOKS   AGE
validatingwebhookconfiguration.admissionregistration.k8s.io/autoscaling.openshift.io                                          2          348d
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook                                              1          341d
validatingwebhookconfiguration.admissionregistration.k8s.io/cluster-baremetal-validating-webhook-configuration                1          60d
validatingwebhookconfiguration.admissionregistration.k8s.io/config.webhook.eventing.knative.dev                               1          313d
validatingwebhookconfiguration.admissionregistration.k8s.io/config.webhook.serving.knative.dev                                1          313d
validatingwebhookconfiguration.admissionregistration.k8s.io/machine-api                                                       2          348d
validatingwebhookconfiguration.admissionregistration.k8s.io/multus.openshift.io                                               1          348d
validatingwebhookconfiguration.admissionregistration.k8s.io/prometheusrules.openshift.io                                      1          348d
validatingwebhookconfiguration.admissionregistration.k8s.io/snapshot.storage.k8s.io                                           1          210d
validatingwebhookconfiguration.admissionregistration.k8s.io/validating-knative-openshift                                      3          313d
validatingwebhookconfiguration.admissionregistration.k8s.io/validating-webhook-configuration                                  10         40d
validatingwebhookconfiguration.admissionregistration.k8s.io/validating.knativeeventings.operator.serverless.openshift.9dlw7   1          27d
validatingwebhookconfiguration.admissionregistration.k8s.io/validating.knativekafkas.operator.serverless.openshift.io-mxfkc   1          27d
validatingwebhookconfiguration.admissionregistration.k8s.io/validating.knativeservings.operator.serverless.openshift.ibrvqx   1          27d
validatingwebhookconfiguration.admissionregistration.k8s.io/validation.inmemorychannel.eventing.knative.dev                   1          61d
validatingwebhookconfiguration.admissionregistration.k8s.io/validation.webhook.domainmapping.serving.knative.dev              1          272d
validatingwebhookconfiguration.admissionregistration.k8s.io/validation.webhook.eventing.knative.dev                           1          313d
validatingwebhookconfiguration.admissionregistration.k8s.io/validation.webhook.serving.knative.dev                            1          313d

davidkarlsen avatar Nov 17 '21 15:11 davidkarlsen

Does your cluster-baremetal-validating-webhook-configuration validating webhook have any logs regarding these objects?

When this error occurs, is there a hanging deployment / replica set object that is still living?

  • If this is the case -- could you post the events / details on those objects?

I have bumped up the termination grace seconds up to 15s instead of the original 1s -- this should help with k8s processing the request for creating deployments. Please try using the newer image kuberhealthy/deployment-check:issue1008

jonnydawg avatar Dec 01 '21 20:12 jonnydawg

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment on the issue or this will be closed in 15 days.

github-actions[bot] avatar Jan 27 '24 00:01 github-actions[bot]

This issue was closed because it has been stalled for 15 days with no activity. Please reopen and comment on the issue if you believe it should stay open.

github-actions[bot] avatar Feb 19 '24 00:02 github-actions[bot]