draino
draino copied to clipboard
No actions being taken by draino for the condition Ready=Unknown
Please help me out because I am not able to make Draino work. Even if the node is in unknown state it doesn't drain nodes. I have used kops to spin up the cluster with 1 master and 2 nodes. I make the node reach unknown state by using this script and start on a node:-
#!/bin/bash
for (( ; ; ))
do
echo "Press CTRL+C to stop............................................."
nohup ./run.sh &
done
My draino.yaml is this:-
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels: {component: draino}
name: draino
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels: {component: draino}
name: draino
rules:
- apiGroups: ['']
resources: [events]
verbs: [create, patch, update]
- apiGroups: ['']
resources: [nodes]
verbs: [get, watch, list, update]
- apiGroups: ['']
resources: [nodes/status]
verbs: [patch]
- apiGroups: ['']
resources: [pods]
verbs: [get, watch, list]
- apiGroups: ['']
resources: [pods/eviction]
verbs: [create]
- apiGroups: [extensions]
resources: [daemonsets]
verbs: [get, watch, list]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels: {component: draino}
name: draino
roleRef: {apiGroup: rbac.authorization.k8s.io, kind: ClusterRole, name: draino}
subjects:
- {kind: ServiceAccount, name: draino, namespace: kube-system}
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels: {component: draino}
name: draino
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels: {component: draino}
template:
metadata:
labels: {component: draino}
name: draino
namespace: kube-system
spec:
containers:
- name: draino
image: planetlabs/draino:5e07e93
command:
- /draino
- Ready=Unknown
livenessProbe:
httpGet: {path: /healthz, port: 10002}
initialDelaySeconds: 30
serviceAccountName: draino
My node-problem-detector.yaml is this:-
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-problem-detector
namespace: kube-system
labels:
app: node-problem-detector
spec:
selector:
matchLabels:
app: node-problem-detector
template:
metadata:
labels:
app: node-problem-detector
spec:
containers:
- name: node-problem-detector
command:
- /node-problem-detector
- --logtostderr
- --system-log-monitors=/config/kernel-monitor.json,/config/docker-monitor.json
image: k8s.gcr.io/node-problem-detector:v0.6.3
resources:
limits:
cpu: 10m
memory: 80Mi
requests:
cpu: 10m
memory: 80Mi
imagePullPolicy: Always
securityContext:
privileged: true
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: log
mountPath: /var/log
readOnly: true
- name: kmsg
mountPath: /dev/kmsg
readOnly: true
# Make sure node problem detector is in the same timezone
# with the host.
- name: localtime
mountPath: /etc/localtime
readOnly: true
- name: config
mountPath: /config
readOnly: true
volumes:
- name: log
# Config `log` to your system log directory
hostPath:
path: /var/log/
- name: kmsg
hostPath:
path: /dev/kmsg
- name: localtime
hostPath:
path: /etc/localtime
- name: config
configMap:
name: node-problem-detector-config
items:
- key: kernel-monitor.json
path: kernel-monitor.json
- key: docker-monitor.json
path: docker-monitor.json
And finally the node-problem-detector-config.yaml is this:-
apiVersion: v1
data:
kernel-monitor.json: |
{
"plugin": "kmsg",
"logPath": "/dev/kmsg",
"lookback": "5m",
"bufferSize": 10,
"source": "kernel-monitor",
"conditions": [
{
"type": "KernelDeadlock",
"reason": "KernelHasNoDeadlock",
"message": "kernel has no deadlock"
},
{
"type": "ReadonlyFilesystem",
"reason": "FilesystemIsReadOnly",
"message": "Filesystem is read-only"
},
{
"type": "Ready",
"reason": "NodeStatusUnknown",
"message": "Kubelet stopped posting node status"
}
],
"rules": [
{
"type": "temporary",
"reason": "NodeStatusUnknown",
"pattern": "Kubelet stopped posting node status"
},
{
"type": "temporary",
"reason": "OOMKilling",
"pattern": "Kill process \\d+ (.+) score \\d+ or sacrifice child\\nKilled process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB.*"
},
{
"type": "temporary",
"reason": "TaskHung",
"pattern": "task \\S+:\\w+ blocked for more than \\w+ seconds\\."
},
{
"type": "temporary",
"reason": "UnregisterNetDevice",
"pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+"
},
{
"type": "temporary",
"reason": "KernelOops",
"pattern": "BUG: unable to handle kernel NULL pointer dereference at .*"
},
{
"type": "temporary",
"reason": "KernelOops",
"pattern": "divide error: 0000 \\[#\\d+\\] SMP"
},
{
"type": "permanent",
"condition": "KernelDeadlock",
"reason": "AUFSUmountHung",
"pattern": "task umount\\.aufs:\\w+ blocked for more than \\w+ seconds\\."
},
{
"type": "permanent",
"condition": "KernelDeadlock",
"reason": "DockerHung",
"pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\."
},
{
"type": "permanent",
"condition": "ReadonlyFilesystem",
"reason": "FilesystemIsReadOnly",
"pattern": "Remounting filesystem read-only"
}
]
}
docker-monitor.json: |
{
"plugin": "journald",
"pluginConfig": {
"source": "dockerd"
},
"logPath": "/var/log/journal",
"lookback": "5m",
"bufferSize": 10,
"source": "docker-monitor",
"conditions": [],
"rules": [
{
"type": "temporary",
"reason": "CorruptDockerImage",
"pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*"
}
]
}
kind: ConfigMap
metadata:
name: node-problem-detector-config
namespace: kube-system
I came across something similar. Make sure to upgrade your draino and the helm chart to the newer version.
Also, for the clli args that you have supplied,Ready=Unknown
. This is incorrect in the newer version and should be e.g. Ready=Unknown,10m
. See https://github.com/planetlabs/draino/issues/33 for more info.
Note: I'm running release b788331
and this is working as expected.
@dliao-tyro Did you have some complete draino success (cordon+drain) for such configuration Ready=Unknown,10m
?
https://github.com/planetlabs/draino/issues/33#issuecomment-602466215
I ran the following experiment to demonstrate the behaviour.
Stopping the kubelet service on one of the nodes. This will report back a NotReady status with the following:
Name: ip-12-18-92-92.ap-southeast-2.compute.internal
Roles: workload
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m5.large
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=ap-southeast-2
failure-domain.beta.kubernetes.io/zone=ap-southeast-2b
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-12-18-92-92.ap-southeast-2.compute.internal
kubernetes.io/lifecycle=spot
kubernetes.io/os=linux
node-role.kubernetes.io/workload=true
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 23 Mar 2020 17:28:25 +1100
Taints: node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unreachable:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: ip-12-18-92-92.ap-southeast-2.compute.internal
AcquireTime: <unset>
RenewTime: Tue, 24 Mar 2020 08:17:20 +1100
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
KernelDeadlock False Tue, 24 Mar 2020 08:26:05 +1100 Mon, 23 Mar 2020 18:42:05 +1100 KernelHasNoDeadlock kernel has no deadlock
ReadonlyFilesystem False Tue, 24 Mar 2020 08:26:05 +1100 Mon, 23 Mar 2020 18:42:05 +1100 FilesystemIsNotReadOnly Filesystem is not read-only
CannotKillContainer False Tue, 24 Mar 2020 08:26:05 +1100 Mon, 23 Mar 2020 18:42:05 +1100 NoCannotKillContainer System can stop containers
FrequentKubeletRestart False Tue, 24 Mar 2020 08:26:05 +1100 Mon, 23 Mar 2020 18:42:07 +1100 FrequentKubeletRestart kubelet is functioning properly
FrequentDockerRestart False Tue, 24 Mar 2020 08:26:05 +1100 Mon, 23 Mar 2020 18:42:09 +1100 FrequentDockerRestart docker is functioning properly
FrequentContainerdRestart False Tue, 24 Mar 2020 08:26:05 +1100 Mon, 23 Mar 2020 18:42:11 +1100 FrequentContainerdRestart containerd is functioning properly
FrequentAwslogsdRestart False Tue, 24 Mar 2020 08:26:05 +1100 Mon, 23 Mar 2020 18:42:13 +1100 FrequentAwslogsdRestart awslogsd is functioning properly
MemoryPressure Unknown Tue, 24 Mar 2020 08:17:06 +1100 Tue, 24 Mar 2020 08:18:04 +1100 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Tue, 24 Mar 2020 08:17:06 +1100 Tue, 24 Mar 2020 08:18:04 +1100 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Tue, 24 Mar 2020 08:17:06 +1100 Tue, 24 Mar 2020 08:18:04 +1100 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Tue, 24 Mar 2020 08:17:06 +1100 Tue, 24 Mar 2020 08:18:04 +1100 NodeStatusUnknown Kubelet stopped posting node status.
10 mins later, the draino logs will output:
{"level":"info","ts":1584998887.2430277,"caller":"kubernetes/eventhandler.go:139","msg":"Cordoned","node":"ip-12-18-92-92.ap-southeast-2.compute.internal"}
{"level":"info","ts":1584998887.2431808,"caller":"kubernetes/eventhandler.go:148","msg":"Scheduled drain","node":"ip-12-18-92-92.ap-southeast-2.compute.internal","after":1584999148.9898398}
I went back to start the kubelet service on the same node and the draino logs the following output:
{"level":"info","ts":1584998567.4310849,"caller":"draino/draino.go:172","msg":"node watcher is running"}
{"level":"info","ts":1584998887.2430277,"caller":"kubernetes/eventhandler.go:139","msg":"Cordoned","node":"ip-12-18-92-92.ap-southeast-2.compute.internal"}
{"level":"info","ts":1584998887.2431808,"caller":"kubernetes/eventhandler.go:148","msg":"Scheduled drain","node":"ip-10-18-92-92.ap-southeast-2.compute.internal","after":1584999148.9898398}
{"level":"info","ts":1584999250.2038314,"caller":"kubernetes/eventhandler.go:161","msg":"Drained","node":"ip-12-18-92-92.ap-southeast-2.compute.internal"}
So overall, this does seem to be working as expected with the following configuration in the container spec.
- command:
- /draino
- CannotKillContainer
- DiskPressure
- FrequentContainerdRestart
- FrequentDockerRestart
- FrequentKubeletRestart
- KernelDeadlock
- NetworkUnavailable
- OutOfDisk
- PIDPressure
- ReadonlyFilesystem
- Ready=Unknown,10m
Let me know if you need more information.
@dliao-tyro ok, but in order to have the drain completed you had to restart the kubelet, which will make the node responsive again (transition from unknown
(aka notReady) to ready
state).
I was expecting the solution to work on top of a node that remains notReady. IMO it can due to the way the eviction API works, waiting for kubelet status to confirm pod eviction.
@dbenque, The example above was just to demonstrate draino would attempt to drain the node if Ready=Unknown condition was met; not sure if there was a quick/simple way to set that particular node condition.
For the drain to be successful, I think there needs to be an expectation that kubelet remains operational so that it can report on status?
Hi :)
Same problem here : we wanted to try draino with the Ready=Unknown,2m
by stopping the kubelet and impossible for draino to finish the drain.
Does someone has a idea to solve this please ?