redis-operator icon indicating copy to clipboard operation
redis-operator copied to clipboard

Operator is vulnerable to misoperations and drives the cluster to broken state

Open hoyhbx opened this issue 1 year ago • 4 comments

What version of redis operator are you using?

kubectl logs <_redis-operator_pod_name> -n <namespace>

redis-operator version: quay.io/opstree/redis-operator:v0.10.0

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (kubectl version)?

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.1", GitCommit:"3ddd0f45aa91e2f30c70734b175631bec5b5825a", GitTreeState:"clean", BuildDate:"2022-05-24T12:26:19Z", GoVersion:"go1.18.2", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.9", GitCommit:"6df4433e288edc9c40c2e344eb336f63fad45cd2", GitTreeState:"clean", BuildDate:"2022-05-19T19:53:08Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

What did you do?

Hello redis operator developers,

We found that many properties in the CR are very easy to drive the cluster into a broken state if not handled carefully. For example, when specifying a bad value for the properties in spec.redisFollower.affinity, it causes the statefulSet to restart but the restarted pod cannot be scheduled. There are a lot of other examples include spec.redisFollower.livenessProbe which causes the Redis to be restarted all the time, spec.redisFollower.redisConfig causing redis to crash due to misconfiguration...

A concrete example is to submit the a CR with following advanced affinity

spec:
  redisFollower:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
              - 'NULL'
...

The operator updates the statefulSet which triggers a rolling update, but the newly started redis pod cannot be started because no node satisfies the affinity rule.

It causes severe consequences in production. We believe these are misoperation vulnerabilities where the operator fails to reject the misoperation from users. The operator uses controller-gen to automatically generate validations for many properties, but these static validations fall short for validating more complicated contraints, e.g. to reject an invalid nodeSelector needs knowledge of which nodes are available in Kubernetes cluster, validating whether Affinity rule is satisfiable requires the scheduler knowledge. Validating redisConfig needs system specific knowledge of redis

We want to open this issue to discuss what do you think should be the best practice to handle this issue, or what functionalities should the Kubernetes provide to make this validation easier. Is there a way to prevent the bad operation from happening in the first place, or there is a way for the operator to automatically recognize the statefulSet is stuck and perform an automatic recovery. If you know of any practical code fixes for this issue, we are also happy to send a PR for that.

We are also happy to provide the full list of properties which are vulnerable to misoperations if you are interested

hoyhbx avatar Apr 12 '23 19:04 hoyhbx

@hoyhbx Thanks for reporting this issue. I think you are right this need to be fixed soon. If you want to report more vulnerable to misoperations you are most welcome.

shubham-cmyk avatar Apr 13 '23 14:04 shubham-cmyk

Thanks @shubham-cmyk for the confirmation! Here is the list of properties in the CR which are vulnerable to misoperations we found:

  • cr.spec.kubernetesConfig.imagePullPolicy
  • cr.spec.kubernetesConfig.redisSecret
  • cr.spec.kubernetesConfig.resources.limits.* when specified with a resource nonexistent in the cluster
  • cr.spec.nodeSelector.* when the node is not available for scheduling in the cluster
  • cr.spec.priorityClassName when the class does not exist
  • cr.spec.redisExporter.env.* when the referenced object does not exist in the cluster
  • cr.spec.redisFollower.affinity
  • cr.spec.redisFollower.livenessProbe.* when the probe is not supported by the application
  • cr.spec.redisFollower.redisConfig.* when the config is misconfiguration
  • cr.spec.sidecars.env.* when the referenced object does not exist in the cluster
  • cr.spec.TLS
  • cr.spec.tolerations

Some of it could be statically checked, e.g. imagePullPolicy should be an enum property and could be validated using the CRD. Some of them need application specific knowledge, e.g. cr.spec.redisFollower.redisConfig.* to validate if the redis config is correct or not. Others need runtime information to check if the value is valid, e.g. cr.spec.priorityClassName needs to check if the priorityClass is present in the cluster.

hoyhbx avatar Apr 14 '23 19:04 hoyhbx

@hoyhbx I would ship a validation webhook with next release

shubham-cmyk avatar Apr 15 '23 17:04 shubham-cmyk

/assign

drivebyer avatar Sep 08 '23 01:09 drivebyer