kuberhealthy
kuberhealthy copied to clipboard
deployment pods are left in Terminating state
Describe the bug From time to time there will be pods stuck on terminating:
k -n kuberhealthy get pods
NAME READY STATUS RESTARTS AGE
deployment-deployment-7c775dfbdd-6csn5 0/1 Terminating 0 5d17h
it's always the deployment one, never the daemonset
Steps To Reproduce
- Run kuberhealthy
Expected behavior All test pods should be removed properly
Screenshots N/A
Versions
- OCP 4.7
- Kubernetes Version: 1.20
- Kuberhealthy Release or build [e.g. 0.1.5 or 235]
Additional context chart install params:
spec:
chart:
spec:
chart: kuberhealthy
sourceRef:
kind: HelmRepository
name: kuberhealthy
namespace: flux-system
version: "77"
interval: 10m
values:
imageRegistry: docker-nexus.finods.com/kuberhealthy
podDisruptionBudget:
enabled: false
prometheus:
enabled: true
serviceMonitor:
enabled: true
resources:
requests:
cpu: 10m
Maybe related? https://github.com/kuberhealthy/kuberhealthy/issues/254
Thanks for sharing this!
Could you please post logs for the deployment pod and the events for the deployment-deployment pod?
What version of the deployment-check are you using?
Could you please post logs for the
deploymentpod and the events for thedeployment-deploymentpod?
I'll wait for one to hang. Just cleared them out today.
What version of the
deployment-checkare you using?
v1.9.0
here's one
iority: 0
Node: alt-ksx-r-c01oco03/139.114.216.183
Start Time: Fri, 20 Aug 2021 19:20:02 +0200
Labels: deployment-timestamp=unix-1629479963
pod-template-hash=56ff574674
source=kuberhealthy
Annotations: k8s.v1.cni.cncf.io/network-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"10.200.6.98"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"10.200.6.98"
],
"default": true,
"dns": {}
}]
openshift.io/scc: anyuid
Status: Terminating (lasts 5d4h)
Termination Grace Period: 1s
IP:
IPs: <none>
Controlled By: ReplicaSet/deployment-deployment-56ff574674
Containers:
deployment-container:
Container ID:
Image: nginxinc/nginx-unprivileged:1.17.9
Image ID:
Port: 8080/TCP
Host Port: 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 75m
memory: 75Mi
Requests:
cpu: 15m
memory: 20Mi
Liveness: tcp-socket :8080 delay=2s timeout=2s period=15s #success=1 #failure=5
Readiness: tcp-socket :8080 delay=2s timeout=2s period=15s #success=1 #failure=5
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-m5jf4 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-m5jf4:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-m5jf4
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none
k logs deployment-deployment-56ff574674-z49b5
Error from server (BadRequest): container "deployment-container" in pod "deployment-deployment-56ff574674-z49b5" is waiting to start: ContainerCreating
Did you happen to get the logs from the deployment-xxxxxx pod?
See above - second section. That won't be output due to the status.
I mean the other pod -- there is a pod that will be named deployment-xxxxx that is created by the kuberhealthy master. Thedeployment-xxxxx is the pod that handles creating this deployment-deployment-xxxxx pod. It's a little confusing, but the pod that you have posted is part of the nginx deployment that is created by the deployment-xxxxx -- there might be more information in the logs there.
Anyways, this should be enough to start investigating -- thanks!
Nothing too interesting, maybe this?
level=info msg="Could not delete service: deployment-svc"
--
| September 3rd 2021, 17:15:45.829 | time="2021-09-03T15:15:45Z" level=info msg="Found an old deployment belonging to this check: deployment-deployment"
| September 3rd 2021, 17:15:45.829 | time="2021-09-03T15:15:45Z" level=info msg="Found previous deployment."
@jonnydawg do you need more info? What can be the cause of this?
@davidkarlsen I have not been able to reproduce this issue yet. Currently working on getting a test openshift cluster to test on.
@davidkarlsen Are you running any custom validating or mutating webhooks in this cluster?
@jonnydawg
k get mutatingwebhookconfigurations,validatingwebhookconfigurations
NAME WEBHOOKS AGE
mutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook 1 341d
mutatingwebhookconfiguration.admissionregistration.k8s.io/inmemorychannel.eventing.knative.dev 1 61d
mutatingwebhookconfiguration.admissionregistration.k8s.io/machine-api 2 348d
mutatingwebhookconfiguration.admissionregistration.k8s.io/mutating-knative-openshift 2 313d
mutatingwebhookconfiguration.admissionregistration.k8s.io/mutating.knativeeventings.operator.serverless.openshift.iozgvc6 1 27d
mutatingwebhookconfiguration.admissionregistration.k8s.io/mutating.knativekafkas.operator.serverless.openshift.io-xq4wf 1 27d
mutatingwebhookconfiguration.admissionregistration.k8s.io/mutating.knativeservings.operator.serverless.openshift.io-7sj5b 1 27d
mutatingwebhookconfiguration.admissionregistration.k8s.io/sinkbindings.webhook.sources.knative.dev 1 313d
mutatingwebhookconfiguration.admissionregistration.k8s.io/vault-agent-injector-cfg 1 279d
mutatingwebhookconfiguration.admissionregistration.k8s.io/webhook.domainmapping.serving.knative.dev 1 272d
mutatingwebhookconfiguration.admissionregistration.k8s.io/webhook.eventing.knative.dev 1 313d
mutatingwebhookconfiguration.admissionregistration.k8s.io/webhook.serving.knative.dev 1 313d
NAME WEBHOOKS AGE
validatingwebhookconfiguration.admissionregistration.k8s.io/autoscaling.openshift.io 2 348d
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook 1 341d
validatingwebhookconfiguration.admissionregistration.k8s.io/cluster-baremetal-validating-webhook-configuration 1 60d
validatingwebhookconfiguration.admissionregistration.k8s.io/config.webhook.eventing.knative.dev 1 313d
validatingwebhookconfiguration.admissionregistration.k8s.io/config.webhook.serving.knative.dev 1 313d
validatingwebhookconfiguration.admissionregistration.k8s.io/machine-api 2 348d
validatingwebhookconfiguration.admissionregistration.k8s.io/multus.openshift.io 1 348d
validatingwebhookconfiguration.admissionregistration.k8s.io/prometheusrules.openshift.io 1 348d
validatingwebhookconfiguration.admissionregistration.k8s.io/snapshot.storage.k8s.io 1 210d
validatingwebhookconfiguration.admissionregistration.k8s.io/validating-knative-openshift 3 313d
validatingwebhookconfiguration.admissionregistration.k8s.io/validating-webhook-configuration 10 40d
validatingwebhookconfiguration.admissionregistration.k8s.io/validating.knativeeventings.operator.serverless.openshift.9dlw7 1 27d
validatingwebhookconfiguration.admissionregistration.k8s.io/validating.knativekafkas.operator.serverless.openshift.io-mxfkc 1 27d
validatingwebhookconfiguration.admissionregistration.k8s.io/validating.knativeservings.operator.serverless.openshift.ibrvqx 1 27d
validatingwebhookconfiguration.admissionregistration.k8s.io/validation.inmemorychannel.eventing.knative.dev 1 61d
validatingwebhookconfiguration.admissionregistration.k8s.io/validation.webhook.domainmapping.serving.knative.dev 1 272d
validatingwebhookconfiguration.admissionregistration.k8s.io/validation.webhook.eventing.knative.dev 1 313d
validatingwebhookconfiguration.admissionregistration.k8s.io/validation.webhook.serving.knative.dev 1 313d
Does your cluster-baremetal-validating-webhook-configuration validating webhook have any logs regarding these objects?
When this error occurs, is there a hanging deployment / replica set object that is still living?
- If this is the case -- could you post the events / details on those objects?
I have bumped up the termination grace seconds up to 15s instead of the original 1s -- this should help with k8s processing the request for creating deployments. Please try using the newer image kuberhealthy/deployment-check:issue1008
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment on the issue or this will be closed in 15 days.
This issue was closed because it has been stalled for 15 days with no activity. Please reopen and comment on the issue if you believe it should stay open.