draino
draino copied to clipboard
pods left in "unknown" state
We run draino with the following condition, among others:
- Ready=Unknown,3m #Drain node if unavailable for 3 minutes
I simulated a test by stopping/disabling the kubelet on a node. This causes Draino to cordon it as expected, but we're noticing that pods are left in an "unknown" state on the pod since the kubelet is gone:
╰─ kubectl get pods
NAME READY STATUS RESTARTS AGE
rikstv-api-recommendation-v1-uat-main-6d9b46bf-9gj44 4/4 Running 1 20m
rikstv-api-recommendation-v1-uat-main-6d9b46bf-k89hz 4/4 Unknown 0 149m
rikstv-api-recommendation-v1-uat-main-6d9b46bf-t84ns 4/4 Running 1 20m
rikstv-api-recommendation-v1-uat-main-6d9b46bf-vdj2g 4/4 Unknown 0 149m
This again seems to trick cluster-autoscaler into not removing the failed node. I'm not sure draino should be responsible for cleaning up pod objects on crashed nodes, but I thought I'd ask here anyway, since it's probably a typical situation draino users can get in to.
Does Draino have (or is anything planned) functionality for cleaning up failed nodes so that cluster-autoscaler can delete them cleanly?
Is there any chance you could reproduce this and dump the YAML (i.e. kubectl get -o yaml
) one of the pods in an Unknown
state? I'm wondering if there's a finalizer or something keeping them around.
We don't have anything planned around this, but I do think it's a use case we should address if there's a clean way to do so.
Nice! I think this just happens when the kubelet is stopped. Replacement pods are spun up on new nodes, but the "orphaned" pods are still stuck. We're doing some testing around this the coming days, will update this issue with an example pod when I get a chance.
Please help me out because I am not able to make Draino work. Even if the node is in unknown state it doesn't drain nodes. I have used kops to spin up the cluster with 1 master and 2 nodes. I make the node reach unknown state by using this script and start on a node:-
#!/bin/bash
for (( ; ; ))
do
echo "Press CTRL+C to stop............................................."
nohup ./run.sh &
done
and my draino.yaml is this
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels: {component: draino}
name: draino
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels: {component: draino}
name: draino
rules:
- apiGroups: ['']
resources: [events]
verbs: [create, patch, update]
- apiGroups: ['']
resources: [nodes]
verbs: [get, watch, list, update]
- apiGroups: ['']
resources: [nodes/status]
verbs: [patch]
- apiGroups: ['']
resources: [pods]
verbs: [get, watch, list]
- apiGroups: ['']
resources: [pods/eviction]
verbs: [create]
- apiGroups: [extensions]
resources: [daemonsets]
verbs: [get, watch, list]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels: {component: draino}
name: draino
roleRef: {apiGroup: rbac.authorization.k8s.io, kind: ClusterRole, name: draino}
subjects:
- {kind: ServiceAccount, name: draino, namespace: kube-system}
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels: {component: draino}
name: draino
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels: {component: draino}
template:
metadata:
labels: {component: draino}
name: draino
namespace: kube-system
spec:
containers:
- name: draino
image: planetlabs/draino:5e07e93
command:
- /draino
- --debug
- Ready=Unknown
livenessProbe:
httpGet: {path: /healthz, port: 10002}
initialDelaySeconds: 30
serviceAccountName: draino
@prabhatnagpal I suggest raising a separate Github issue for your problems, and including any logs and metrics Draino emits with your issue so we can try to help you out.