cluster-operator
cluster-operator copied to clipboard
storageos pod is not evicted when kube node is NotReady
Hello,
i see that storageos-daemonset POD is not being "removed" form the NotReady node.
dm103 Ready master 37d v1.13.0 192.168.3.249 <none> Ubuntu 16.04.5 LTS 4.4.0-87-generic docker://17.3.2
dm104 NotReady <none> 37d v1.13.0 192.168.3.251 <none> Ubuntu 16.04.3 LTS 4.4.0-87-generic docker://17.3.2
dm201 Ready <none> 37d v1.13.0 192.168.3.231 <none> Ubuntu 16.04.3 LTS 4.4.0-87-generic docker://17.3.2
dm202 Ready <none> 37d v1.13.0 192.168.3.229 <none> Ubuntu 16.04.3 LTS 4.4.0-87-generic docker://17.3.2
dm203 Ready <none> 37d v1.13.0 192.168.3.225 <none> Ubuntu 16.04.3 LTS 4.4.0-87-generic docker://17.3.2
dm204 Ready <none> 37d v1.13.0 192.168.3.226 <none> Ubuntu 16.04.3 LTS 4.4.0-87-generic docker://17.3.2
storageos storageos-daemonset-h2kzs 3/3 Running 0 87m 192.168.3.231 dm201 <none> <none>
storageos storageos-daemonset-hhlz4 3/3 Running 0 18m 192.168.3.251 dm104 <none> <none>
storageos storageos-daemonset-jrlcn 3/3 Running 0 87m 192.168.3.225 dm203 <none> <none>
storageos storageos-daemonset-n2tqs 3/3 Running 0 87m 192.168.3.226 dm204 <none> <none>
storageos storageos-daemonset-s6v88 3/3 Running 0 87m 192.168.3.229 dm202 <none> <none>
i waited for minuets
Hi, can you provide more details about your deployment? The cluster config file or a description of the storageoscluster resource would be very helpful.
If the operator version you're running is from the master branch and not a stable release, this is the right behavior. In #110 we added pod priority class to the storageos pods to make them critical resource, resulting in no eviction. We need this because storage is a critical part of a cluster.
This behavior will be seen only if storageos is deployed in kube-system namespace. For any other namespaces, the pod priority won't be set.
sure, it is form the master
more storageos_v1_storageoscluster_crd.yaml
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: storageosclusters.storageos.com
spec:
group: storageos.com
names:
kind: StorageOSCluster
listKind: StorageOSClusterList
plural: storageosclusters
singular: storageoscluster
shortNames:
- stos
scope: Namespaced
version: v1
additionalPrinterColumns:
- name: Ready
type: string
description: Ready status of the storageos nodes.
JSONPath: .status.ready
- name: Status
type: string
description: Status of the whole cluster.
JSONPath: .status.phase
- name: Age
type: date
JSONPath: .metadata.creationTimestamp
validation:
openAPIV3Schema:
properties:
apiVersion:
type: string
kind:
type: string
metadata: {}
spec:
properties:
join:
type: string
namespace:
type: string
k8sDistro:
type: string
disableFencing:
type: boolean
disableTelemetry:
type: boolean
images:
properties:
nodeContainer:
type: string
initContainer:
type: string
csiDriverRegistrarContainer:
type: string
csiExternalProvisionerContainer:
type: string
csiExternalAttacherContainer:
type: string
csi:
properties:
enable:
type: boolean
enableProvisionCreds:
type: boolean
enableControllerPublishCreds:
type: boolean
enableNodePublishCreds:
type: boolean
service:
properties:
name:
type: string
type:
type: string
externalPort:
type: integer
format: int32
internalPort:
type: integer
format: int32
secretRefName:
type: string
secretRefNamespace:
type: string
tlsEtcdSecretRefName:
type: string
tlsEtcdSecretRefNamespace:
type: string
sharedDir:
type: string
ingress:
properties:
enable:
type: boolean
hostname:
type: string
tls:
type: boolean
annotations: {}
kvBackend:
properties:
address:
type: string
backend:
type: string
pause:
type: boolean
debug:
type: boolean
nodeSelectorTerms: {}
tolerations: {}
resources:
properties:
limits: {}
requests: {}
status:
properties:
phase:
type: string
nodeHealthStatus: {}
nodes:
type: array
items:
type: string
ready:
type: string
members:
properties:
ready: {}
unready: {}
more storageos_v1_storageoscluster_cr.yaml
apiVersion: storageos.com/v1
kind: StorageOSCluster
metadata:
name: example-storageoscluster
namespace: "default"
spec:
secretRefName: "storageos-api"
secretRefNamespace: "default"
namespace: "storageos"
# k8sDistro: openshift
# tlsEtcdSecretRefName:
# tlsEtcdSecretRefNamespace:
# disableTelemetry: true
# images:
# nodeContainer:
# initContainer:
# csiNodeDriverRegistrarContainer:
# csiClusterDriverRegistrarContainer:
# csiExternalProvisionerContainer:
# csiExternalAttacherContainer:
csi:
enable: true
# endpoint: /var/lib/kubelet/device-plugins/
# registrarSocketDir: /var/lib/kubelet/device-plugins/
# kubeletDir: /var/lib/kubelet
# pluginDir: /var/lib/kubelet/plugins/storageos/
# deviceDir: /dev
# registrationDir: /var/lib/kubelet/plugins
# enableProvisionCreds: false
# enableControllerPublishCreds: false
# enableNodePublishCreds: false
# kubeletRegistrationPath: /var/lib/kubelet/plugins/storageos/csi.sock
# driverRegisterationMode: node-register
# DriverRequiresAttachment: "true"
# service:
# name: "storageos"
# type: "ClusterIP"
# externalPort: 5705
# internalPort: 5705
# annotations:
# service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
# service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443,8443"
# ingress:
# enable: false
# hostname: storageos.local
# tls: false
# annotations:
# kubernetes.io/ingress.class: nginx
# kubernetes.io/tls-acme: true
# sharedDir should be set if running kubelet in a container. This should
# be the path shared into to kubelet container, typically:
# "/var/lib/kubelet/plugins/kubernetes.io~storageos". If not set, defaults
# will be used.
# sharedDir:
# kvBackend:
# address:
# backend:
# nodeSelectorTerms:
# - matchExpressions:
# - key: somekey
# operator: In
# values:
# - nodefoo
# tolerations:
# - key: somekey
# operator: "Equal"
# value: nodefoo
# effect: "NoSchedule"
# resources:
# limits:
# memory: "1Gi"
# requests:
# memory: "702Mi"
# disableFencing: false
---
apiVersion: v1
kind: Secret
metadata:
name: "storageos-api"
namespace: "default"
labels:
app: "storageos"
type: "kubernetes.io/storageos"
data:
# echo -n '<secret>' | base64
apiUsername: c3RvcmFnZW9z
apiPassword: c3RvcmFnZW9z
# Add base64 encoded TLS cert and key below if ingress.tls is set to true.
# tls.crt:
# tls.key:
# Add base64 encoded creds below for CSI credentials.
# csiProvisionUsername:
# csiProvisionPassword:
# csiControllerPublishUsername:
# csiControllerPublishPassword:
# csiNodePublishUsername:
# csiNodePublishPassword:
my namespace is : storageos , made no changes so far in the files i got from master
when the "node" removed form the cluster, it is shown in the "kubectl get pods" , but using "describe" - the pod it is not ready, i think at this time it needs to be evicted. please correct me if there are other thoughts or considerations
In that case, I don't think there's anything wrong or anything that can be done. It's the k8s scheduler that makes decisions about eviction based on various factors. Our concern is try to not get evicted as much as possible. We don't have control over the scheduler's decision and I don't see anything bad with it. Usually the k8s scheduler takes action when a node is not available after a minute or so. Maybe waiting for some time will get the pod evicted automatically. But again, it doesn't matters if it's running or gets evicted. StorageOS is running as a DaemonSet on all the other available nodes.
If an instance of storageos is unhealthy, storageos is usually aware of that and takes the necessary actions to failover and move volumes to other healthy nodes.
You can check the state of storageos nodes using the storageos cli command storageos node ls.
Also, kubectl describe stos example-storageoscluster will show you the description of your current cluster with all the active nodes and their health.
if i enable/put the node back to the cluster, the container does not start ( restart ) properly - that is what i saw yesterday.
if i evict the pod manually ( force pod delete )and then add the node back, Deamonset restarts container, then container start successfully
if i add / edit "tolerations:" for the pod :
kubectl edit pod storageos-daemonset-jrlcn -n storageos
with
tolerationSeconds=3 - for example
right before i remove the "node", pod get evicted successfully, and then pod restarts successfully as well when the "node" added
that is what i saw yesterday, if you want me to repeat, please let me know
In that's what's happening we can add tolerations for all the resources, something like this in the pod spec:
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 60
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 60
That should improve the recovery delay for all the pods.
yes, what file needs to modified ? or it is in operators source code ?
To first test it, you can directly edit the daemonset resource and add the toleration in it.
If it results in the right behavior, you can add those toleration in the cluster spec like this. It supports all the toleration attributes, not just the ones in the example.
We don't have any default tolerations at the moment. This type of toleration to improve the recovery time could be good to be default toleration, but maybe sometimes we would like them to be overwritable. For adding defaults, we can add those tolerations to addToleration() function. This function is shared across all the other resources, so that would be good. But for now, I think we should try adding the toleration in the cluster resource manifest file. Need to think more about better ways to have some default overridable tolerations and test more.
Thanks for highlighting this issue.
i agree,
if you test , and see different beaver, please let me know
since "node down" can be possible event that we need to heal, I think evicting the pod make scene
and to my taste,- as soon as possible, then restarting it then the node is back please share if this works well from storageOS point of view
I tried this and I'm afraid we can't do much when it comes to DaemonSets. As per https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions
DaemonSet pods are created with NoExecute tolerations for the following taints with no tolerationSeconds: node.kubernetes.io/unreachable node.kubernetes.io/not-ready This ensures that DaemonSet pods are never evicted due to these problems, which matches the behavior when this feature is disabled.
We can't set the toleration time. I guess we have to wait for the k8s scheduler to automatically detect and restart the DaemonSet pod.
more info here : https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/taints_tolerations.html :
Pods that tolerate the taint without specifying tolerationSeconds
in their toleration specification
remain bound forever.