draino icon indicating copy to clipboard operation
draino copied to clipboard

Ask god what the draino works?

Open ghost opened this issue 5 years ago • 4 comments

Dear gods, please ask me a question about NPD. I don't know if it's right. I hope I can correct my question. Draino is a relief system, which can be used with NPD. For example, when NPD is subject to kernel deadlock, or CPU disk is broken, NPD can get this information. In order to prevent reuse of this node, draino will Each node is set as maintenance and cannot be scheduled to prevent other containers from being allocated to this node and expel the pod of the node. This understanding is correct, but to realize this function, the job of the rescue system is to judge whether the information NPD obtains is an internal core deadlock, CPU, The disk problem triggers this rule to set up, maintain and expel this node. After the expelling, can redundant nodes be guaranteed to schedule the pod of the previous node? Perhaps autoscale will be used. Is this understanding correct?

ghost avatar Jun 28 '20 10:06 ghost

Yes your understanding is correct:

  • NPD sets node conditions. By themselves, conditions do not cause a node to be drained and cordoned. Many are informational.
  • Draino allows one to configure conditions that should result in the node being drained - what you described as 👇:

In order to prevent reuse of this node, draino will Each node is set as maintenance and cannot be scheduled to prevent other containers from being allocated to this node and expel the pod of the node.

  • Regular kubernetes mechanisms will attempt to schedule your pod to another node (e.g. because only 2/3 pod replicas are now running).
  • As a result of the node being drained, the autoscaler will typically identify an underutilized node and delete it. This threshold is configurable in the autoscaler. Draino does not itself destroy nodes.
  • As a result of there being fewer nodes, the autoscaler will typically create a new one for you if there is is not enough CPU/Memory in the remaining nodes to schedule the pods that were evicted when the node was drained.

jacobstr avatar Jun 29 '20 19:06 jacobstr

Yes your understanding is correct:

* NPD sets node conditions. By themselves, conditions do not cause a node to be `drained` and `cordoned`. Many are informational.

* Draino allows one to configure conditions that should result in the node being `drained` - what you described as 👇:

In order to prevent reuse of this node, draino will Each node is set as maintenance and cannot be scheduled to prevent other containers from being allocated to this node and expel the pod of the node.

* Regular kubernetes mechanisms will attempt to schedule your `pod` to another `node` (e.g. because only 2/3 pod replicas are now running).

* As a result of the node being drained, the autoscaler will typically identify an underutilized node and delete it. This threshold is configurable in the [autoscaler](https://github.com/kubernetes/autoscaler/blob/1434d14ec768cf099a1f3d8f615854bf1361e484/cluster-autoscaler/main.go#L101). Draino does not itself destroy nodes.

* As a result of there being fewer nodes, the autoscaler will typically create a new one for you if there is is not enough CPU/Memory in the remaining nodes to schedule the pods that were `evicted` when the node was `drained`.

I'm sorry I'd like to know more Just mentioned the NPD itself cannot achieve the effect of self-healing, can only access to relevant information, may need to be self healing draino to implement, and this I have been in the machines deployed a draino pod operation on a node, then the draino how to get the NPD events event, there are many kinds of types, and events event may this type may not lead to node is not available, and how to judge draino belongs to such as the kernel of deadlock or CPU to fail, disk broken information? While the draino judgment does allow the information to be larged to mark the node as undispatchable, and the drain is automatically implemented

ghost avatar Jul 01 '20 03:07 ghost

Yes your understanding is correct:

* NPD sets node conditions. By themselves, conditions do not cause a node to be `drained` and `cordoned`. Many are informational.

* Draino allows one to configure conditions that should result in the node being `drained` - what you described as 👇:

In order to prevent reuse of this node, draino will Each node is set as maintenance and cannot be scheduled to prevent other containers from being allocated to this node and expel the pod of the node.

* Regular kubernetes mechanisms will attempt to schedule your `pod` to another `node` (e.g. because only 2/3 pod replicas are now running).

* As a result of the node being drained, the autoscaler will typically identify an underutilized node and delete it. This threshold is configurable in the [autoscaler](https://github.com/kubernetes/autoscaler/blob/1434d14ec768cf099a1f3d8f615854bf1361e484/cluster-autoscaler/main.go#L101). Draino does not itself destroy nodes.

* As a result of there being fewer nodes, the autoscaler will typically create a new one for you if there is is not enough CPU/Memory in the remaining nodes to schedule the pods that were `evicted` when the node was `drained`.

I'm sorry I'd like to know more Just mentioned the NPD itself cannot achieve the effect of self-healing, can only access to relevant information, may need to be self healing draino to implement, and this I have been in the machines deployed a draino pod operation on a node, then the draino how to get the NPD events event, there are many kinds of types, and events event may this type may not lead to node is not available, and how to judge draino belongs to such as the kernel of deadlock or CPU to fail, disk broken information? While the draino judgment does allow the information to be larged to mark the node as undispatchable, and the drain is automatically implemented

ghost avatar Jul 01 '20 04:07 ghost

hey

can someone help me to test or make NPD and draino work together for my custom condition. i have following in configMap for NPD

docker-monitor.json: |
        {
            "plugin": "journald", 
            "pluginConfig": {
                    "source": "docker"
            },
            "logPath": "/var/log/journal", 
            "lookback": "5m",
            "bufferSize": 10,
            "source": "docker-monitor",
            "conditions": [],
            "rules": [              
                    {
                            "type": "temporary", 
                            "reason": "CorruptDockerImage", 
                            "pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*" 
                    }
            ]
        }
    kernel-monitor.json: |
      {
          "plugin": "journald", 
          "pluginConfig": {
                  "source": "kernel"
          },
          "logPath": "/var/log/journal", 
          "lookback": "5m",
          "bufferSize": 10,
          "source": "kernel-monitor",
          "conditions": [                 
                  {
                          "type": "KernelDeadlock", 
                          "reason": "KernelHasNoDeadlock", 
                          "message": "kernel has no deadlock"  
                  },
                  {
                          "type": "Ready",
                          "reason": "NodeStatusUnknown",
                          "message": "Kubelet stopped posting node status"
                  }
          ],
          "rules": [
                  {
                          "type": "temporary",
                          "reason": "OOMKilling",
                          "pattern": "Kill process \\d+ (.+) score \\d+ or sacrifice child\\nKilled process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB"
                  },
                  {
                          "type": "temporary",
                          "reason": "TaskHung",
                          "pattern": "task \\S+:\\w+ blocked for more than \\w+ seconds\\."
                  },
                  {
                          "type": "temporary",
                          "reason": "UnregisterNetDevice",
                          "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+"
                  },
                  {
                          "type": "temporary",
                          "reason": "KernelOops",
                          "pattern": "BUG: unable to handle kernel NULL pointer dereference at .*"
                  },
                  {
                          "type": "temporary/permanent",
                          "condition": "NodeStatusUnknown",
                          "reason": "NodeStatusUnknown",
                          "pattern": "Kubelet stopped posting node status"
                  },
                  {
                          "type": "temporary",
                          "reason": "KernelOops",
                          "pattern": "divide error: 0000 \\[#\\d+\\] SMP"
                  },
                  {
                          "type": "permanent",
                          "condition": "KernelDeadlock",
                          "reason": "AUFSUmountHung",
                          "pattern": "task umount\\.aufs:\\w+ blocked for more than \\w+ seconds\\."
                  },
                  {
                          "type": "permanent",
                          "condition": "KernelDeadlock",
                          "reason": "DockerHung",
                          "pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\."
                  }
          ]
      }

m majorly concerned about following as this happen quite frequently with us.

                 {
                          "type": "temporary/permanent",
                          "condition": "NodeStatusUnknown",
                          "reason": "NodeStatusUnknown",
                          "pattern": "Kubelet stopped posting node status"
                  }

I have draino configured as this

- command: [/draino, --debug, --evict-daemonset-pods, --evict-emptydir-pods, --evict-unreplicated-pods, KernelDeadlock, NodeStatusUnknown]

Is that how it works?

tarunptala avatar Feb 10 '21 09:02 tarunptala