cluster-operator storageos pod is not evicted when kube node is NotReady

Hello,

i see that storageos-daemonset POD is not being "removed" form the NotReady node.

dm103   Ready      master   37d   v1.13.0   192.168.3.249   <none>        Ubuntu 16.04.5 LTS   4.4.0-87-generic   docker://17.3.2
dm104   NotReady   <none>   37d   v1.13.0   192.168.3.251   <none>        Ubuntu 16.04.3 LTS   4.4.0-87-generic   docker://17.3.2
dm201   Ready      <none>   37d   v1.13.0   192.168.3.231   <none>        Ubuntu 16.04.3 LTS   4.4.0-87-generic   docker://17.3.2
dm202   Ready      <none>   37d   v1.13.0   192.168.3.229   <none>        Ubuntu 16.04.3 LTS   4.4.0-87-generic   docker://17.3.2
dm203   Ready      <none>   37d   v1.13.0   192.168.3.225   <none>        Ubuntu 16.04.3 LTS   4.4.0-87-generic   docker://17.3.2
dm204   Ready      <none>   37d   v1.13.0   192.168.3.226   <none>        Ubuntu 16.04.3 LTS   4.4.0-87-generic   docker://17.3.2


storageos            storageos-daemonset-h2kzs                     3/3     Running                      0          87m    192.168.3.231   dm201   <none>           <none>
storageos            storageos-daemonset-hhlz4                     3/3     Running                      0          18m    192.168.3.251   dm104   <none>           <none>
storageos            storageos-daemonset-jrlcn                     3/3     Running                      0          87m    192.168.3.225   dm203   <none>           <none>
storageos            storageos-daemonset-n2tqs                     3/3     Running                      0          87m    192.168.3.226   dm204   <none>           <none>
storageos            storageos-daemonset-s6v88                     3/3     Running                      0          87m    192.168.3.229   dm202   <none>           <none>

i waited for minuets

May 08 '19 20:05 SKD-SKD

Hi, can you provide more details about your deployment? The cluster config file or a description of the storageoscluster resource would be very helpful.

If the operator version you're running is from the master branch and not a stable release, this is the right behavior. In #110 we added pod priority class to the storageos pods to make them critical resource, resulting in no eviction. We need this because storage is a critical part of a cluster. This behavior will be seen only if storageos is deployed in kube-system namespace. For any other namespaces, the pod priority won't be set.

May 09 '19 08:05 darkowlzz

sure, it is form the master

more storageos_v1_storageoscluster_crd.yaml
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: storageosclusters.storageos.com
spec:
  group: storageos.com
  names:
    kind: StorageOSCluster
    listKind: StorageOSClusterList
    plural: storageosclusters
    singular: storageoscluster
    shortNames:
    - stos
  scope: Namespaced
  version: v1
  additionalPrinterColumns:
  - name: Ready
    type: string
    description: Ready status of the storageos nodes.
    JSONPath: .status.ready
  - name: Status
    type: string
    description: Status of the whole cluster.
    JSONPath: .status.phase
  - name: Age
    type: date
    JSONPath: .metadata.creationTimestamp
  validation:
    openAPIV3Schema:
      properties:
        apiVersion:
          type: string
        kind:
          type: string
        metadata: {}
        spec:
          properties:
            join:
              type: string
            namespace:
              type: string
            k8sDistro:
              type: string
            disableFencing:
              type: boolean
            disableTelemetry:
              type: boolean
            images:
              properties:
                nodeContainer:
                  type: string
                initContainer:
                  type: string
                csiDriverRegistrarContainer:
                  type: string
                csiExternalProvisionerContainer:
                  type: string
                csiExternalAttacherContainer:
                  type: string
            csi:
              properties:
                enable:
                  type: boolean
                enableProvisionCreds:
                  type: boolean
                enableControllerPublishCreds:
                  type: boolean
                enableNodePublishCreds:
                  type: boolean
            service:
              properties:
                name:
                  type: string
                type:
                  type: string
                externalPort:
                  type: integer
                  format: int32
                internalPort:
                  type: integer
                  format: int32
            secretRefName:
              type: string
            secretRefNamespace:
              type: string
            tlsEtcdSecretRefName:
              type: string
            tlsEtcdSecretRefNamespace:
              type: string
            sharedDir:
              type: string
            ingress:
              properties:
                enable:
                  type: boolean
                hostname:
                  type: string
                tls:
                  type: boolean
                annotations: {}
            kvBackend:
              properties:
                address:
                  type: string
                backend:
                  type: string
            pause:
              type: boolean
            debug:
              type: boolean
            nodeSelectorTerms: {}
            tolerations: {}
            resources:
              properties:
                limits: {}
                requests: {}
        status:
          properties:
            phase:
              type: string
            nodeHealthStatus: {}
            nodes:
              type: array
              items:
                type: string
            ready:
              type: string
            members:
              properties:
                ready: {}
                unready: {}

more storageos_v1_storageoscluster_cr.yaml
apiVersion: storageos.com/v1
kind: StorageOSCluster
metadata:
  name: example-storageoscluster
  namespace: "default"
spec:
  secretRefName: "storageos-api"
  secretRefNamespace: "default"
  namespace: "storageos"
  # k8sDistro: openshift
  # tlsEtcdSecretRefName:
  # tlsEtcdSecretRefNamespace:
  # disableTelemetry: true
  # images:
  #   nodeContainer:
  #   initContainer:
  #   csiNodeDriverRegistrarContainer:
  #   csiClusterDriverRegistrarContainer:
  #   csiExternalProvisionerContainer:
  #   csiExternalAttacherContainer:
  csi:
    enable: true
  #   endpoint: /var/lib/kubelet/device-plugins/
  #   registrarSocketDir: /var/lib/kubelet/device-plugins/
  #   kubeletDir: /var/lib/kubelet
  #   pluginDir: /var/lib/kubelet/plugins/storageos/
  #   deviceDir: /dev
  #   registrationDir: /var/lib/kubelet/plugins
  #   enableProvisionCreds: false
  #   enableControllerPublishCreds: false
  #   enableNodePublishCreds: false
  #   kubeletRegistrationPath: /var/lib/kubelet/plugins/storageos/csi.sock
  #   driverRegisterationMode: node-register
  #   DriverRequiresAttachment: "true"
  # service:
  #   name: "storageos"
  #   type: "ClusterIP"
  #   externalPort: 5705
  #   internalPort: 5705
  #   annotations:
  #     service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
  #     service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443,8443"
  # ingress:
  #   enable: false
  #   hostname: storageos.local
  #   tls: false
  #   annotations:
  #     kubernetes.io/ingress.class: nginx
  #     kubernetes.io/tls-acme: true
  # sharedDir should be set if running kubelet in a container.  This should
  # be the path shared into to kubelet container, typically:
  # "/var/lib/kubelet/plugins/kubernetes.io~storageos".  If not set, defaults
  # will be used.
  # sharedDir:
  # kvBackend:
  #   address:
  #   backend:
  # nodeSelectorTerms:
  #   - matchExpressions:
  #     - key: somekey
  #       operator: In
  #       values:
  #       - nodefoo
  # tolerations:
  #   - key: somekey
  #     operator: "Equal"
  #     value: nodefoo
  #     effect: "NoSchedule"
  # resources:
  #   limits:
  #     memory: "1Gi"
  #   requests:
  #     memory: "702Mi"
  # disableFencing: false


---
apiVersion: v1
kind: Secret
metadata:
  name: "storageos-api"
  namespace: "default"
  labels:
    app: "storageos"
type: "kubernetes.io/storageos"
data:
  # echo -n '<secret>' | base64
  apiUsername: c3RvcmFnZW9z
  apiPassword: c3RvcmFnZW9z
  # Add base64 encoded TLS cert and key below if ingress.tls is set to true.
  # tls.crt:
  # tls.key:
  # Add base64 encoded creds below for CSI credentials.
  # csiProvisionUsername:
  # csiProvisionPassword:
  # csiControllerPublishUsername:
  # csiControllerPublishPassword:
  # csiNodePublishUsername:
  # csiNodePublishPassword:

May 09 '19 15:05 SKD-SKD

my namespace is : storageos , made no changes so far in the files i got from master

May 09 '19 15:05 SKD-SKD

when the "node" removed form the cluster, it is shown in the "kubectl get pods" , but using "describe" - the pod it is not ready, i think at this time it needs to be evicted. please correct me if there are other thoughts or considerations

May 09 '19 15:05 SKD-SKD

In that case, I don't think there's anything wrong or anything that can be done. It's the k8s scheduler that makes decisions about eviction based on various factors. Our concern is try to not get evicted as much as possible. We don't have control over the scheduler's decision and I don't see anything bad with it. Usually the k8s scheduler takes action when a node is not available after a minute or so. Maybe waiting for some time will get the pod evicted automatically. But again, it doesn't matters if it's running or gets evicted. StorageOS is running as a DaemonSet on all the other available nodes. If an instance of storageos is unhealthy, storageos is usually aware of that and takes the necessary actions to failover and move volumes to other healthy nodes. You can check the state of storageos nodes using the storageos cli command storageos node ls.

May 09 '19 15:05 darkowlzz

Also, kubectl describe stos example-storageoscluster will show you the description of your current cluster with all the active nodes and their health.

May 09 '19 15:05 darkowlzz

if i enable/put the node back to the cluster, the container does not start ( restart ) properly - that is what i saw yesterday.

if i evict the pod manually ( force pod delete )and then add the node back, Deamonset restarts container, then container start successfully

if i add / edit "tolerations:" for the pod :

 kubectl edit pod storageos-daemonset-jrlcn -n storageos

with

tolerationSeconds=3 - for example

right before i remove the "node", pod get evicted successfully, and then pod restarts successfully as well when the "node" added

that is what i saw yesterday, if you want me to repeat, please let me know

May 09 '19 15:05 SKD-SKD

In that's what's happening we can add tolerations for all the resources, something like this in the pod spec:

  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 60
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 60

That should improve the recovery delay for all the pods.

May 09 '19 16:05 darkowlzz

yes, what file needs to modified ? or it is in operators source code ?

May 09 '19 16:05 SKD-SKD

To first test it, you can directly edit the daemonset resource and add the toleration in it. If it results in the right behavior, you can add those toleration in the cluster spec like this. It supports all the toleration attributes, not just the ones in the example. We don't have any default tolerations at the moment. This type of toleration to improve the recovery time could be good to be default toleration, but maybe sometimes we would like them to be overwritable. For adding defaults, we can add those tolerations to addToleration() function. This function is shared across all the other resources, so that would be good. But for now, I think we should try adding the toleration in the cluster resource manifest file. Need to think more about better ways to have some default overridable tolerations and test more.

Thanks for highlighting this issue.

May 09 '19 17:05 darkowlzz

i agree,
if you test , and see different beaver, please let me know

since "node down" can be possible event that we need to heal, I think evicting the pod make scene

and to my taste,- as soon as possible, then restarting it then the node is back please share if this works well from storageOS point of view

May 09 '19 18:05 SKD-SKD

I tried this and I'm afraid we can't do much when it comes to DaemonSets. As per https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions

DaemonSet pods are created with NoExecute tolerations for the following taints with no tolerationSeconds: node.kubernetes.io/unreachable node.kubernetes.io/not-ready This ensures that DaemonSet pods are never evicted due to these problems, which matches the behavior when this feature is disabled.

We can't set the toleration time. I guess we have to wait for the k8s scheduler to automatically detect and restart the DaemonSet pod.

May 10 '19 09:05 darkowlzz

more info here : https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/taints_tolerations.html :

Pods that tolerate the taint without specifying tolerationSeconds
in their toleration specification
remain bound forever.

May 10 '19 15:05 SKD-SKD

cluster-operator cluster-operator copied to clipboard

storageos pod is not evicted when kube node is NotReady

cluster-operator
cluster-operator copied to clipboard