starboard icon indicating copy to clipboard operation
starboard copied to clipboard

Bombardment of 'scan-cisbenchmark-' pods

Open Timoses opened this issue 4 years ago • 9 comments

What steps did you take and what happened:

  1. Installation steps from here: https://aquasecurity.github.io/starboard/v0.10.3/operator/installation/kubectl/
  2. Edited Starboard deployment to use image from our private image registry
  3. Lots of 'scan-cisbenchmark-' pods were generated (over 2600)
    • Why? Starboard configmap by default is configured to use container images from docker.io which is not reachable from our clusters. Therefore the pods terminated.
    • The jobs are now slowly removed by Kubernetes Controller.

What did you expect to happen: The benchmark job should be configured accordingly to not bombard the namespace with pods.

Anything else you would like to add:

/

Environment:

  • Starboard version (use starboard version):
Starboard Version: {Version:0.10.3 Commit:5bd33431a239b98be4a3287563b8664a9b3d5707 Date:2021-05-14T12:20:34Z}
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:09:25Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6", GitCommit:"8a62859e515889f07e3e3be6a1080413f17cf2c3", GitTreeState:"clean", BuildDate:"2021-04-15T03:19:55Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
  • OS (macOS 10.15, Windows 10, Ubuntu 19.10 etc): Fedora 33

Timoses avatar Jul 02 '21 10:07 Timoses

After correcting the imageRef of kubebench it fails as it is unable to mount various mounts. We deploy with RKE/Rancher which uses different directories. It still produces a bombardment of pods ; ).

Timoses avatar Jul 02 '21 11:07 Timoses

👋 @Timoses I'm sorry to hear that Starboard is causing such trouble. We do actually have configurable limit on number of scan jobs created by Starboard Operator to compare programmatically the number of active scan jobs to the limit (OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT). We'll try to reproduce and see what is going on.

danielpacak avatar Jul 02 '21 11:07 danielpacak

Seeing the same thing in our RKE/rancher cluster. Also following https://aquasecurity.github.io/starboard/v0.10.3/operator/installation/kubectl/
k8s-version: 1.19.

sazo avatar Jul 02 '21 22:07 sazo

@Timoses I started looking into this issue and I don't quite understand what is happening in your cluster:

Starboard configmap by default is configured to use container images from docker.io which is not reachable from our clusters. Therefore the pods terminated.

AFAIK, a pod cannot be even started if the image cannot be pulled from a registry and the pod will get stuck in the ImagePullBackOff or ErrImagePull status. How come they were terminated by K8s controller manager?

Beyond that, each scan job created to run CIS benchmarks on a K8s node has a deterministic name to make sure that we create only one instance. Any attempt to create scan jobs with same names would end up with error returned by Kubernetes API. Could you share at least a partial listing of those pods that are "bombarding" your cluster please?

I tried reproducing this issue with a kind cluster by prefixing kube-bench container image with x, i.e. xdocker.io/aquasec/kube-bench:0.5.0. The operator is actually "getting stuck" because it is waiting for job completion or error, but still I don't see any excessive number of pods being created.

That said, could you share logs of the operator and the snapshot output of the following command in your environment when this problem occur? I want to see how scan jobs are named and confirm in which statuses pods and jobs managed by Starboard are.

$ kubectl get pod,job,deploy -n starboard-operator -o wide
NAME                                     READY   STATUS             RESTARTS   AGE   IP           NODE                 NOMINATED NODE   READINESS GATES
pod/scan-cisbenchmark-845b67c44-2gv9z    0/1     ImagePullBackOff   0          87s   10.244.0.8   kind-control-plane   <none>           <none>
pod/starboard-operator-746769c64-9fwj5   1/1     Running            0          16m   10.244.0.5   kind-control-plane   <none>           <none>

NAME                                    COMPLETIONS   DURATION   AGE   CONTAINERS   IMAGES                                SELECTOR
job.batch/scan-cisbenchmark-845b67c44   0/1           87s        87s   kube-bench   xdocker.io/aquasec/kube-bench:0.5.0   controller-uid=e6165dcc-0986-469f-8084-76af040a6fea

NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                                        SELECTOR
deployment.apps/starboard-operator   1/1     1            1           16m   operator     docker.io/aquasec/starboard-operator:0.10.3   app=starboard-operator

Please elaborate also on RKE/Rancher deployment. Is it open source or commercial? Which version did you use and what are the precise steps to reproduce including numer of K8s nodes?

danielpacak avatar Jul 04 '21 13:07 danielpacak

Seeing the same thing in our RKE/rancher cluster. Also following https://aquasecurity.github.io/starboard/v0.10.3/operator/installation/kubectl/ k8s-version: 1.19.

@sazo Could you share the list nodes and the list of scan jobs that are created in the operator namespace in your cluster please? If possible, please share albo the logs of the operator

danielpacak avatar Jul 05 '21 22:07 danielpacak

@Timoses I started looking into this issue and I don't quite understand what is happening in your cluster:

Starboard configmap by default is configured to use container images from docker.io which is not reachable from our clusters. Therefore the pods terminated.

AFAIK, a pod cannot be even started if the image cannot be pulled from a registry and the pod will get stuck in the ImagePullBackOff or ErrImagePull status. How come they were terminated by K8s controller manager?

Beyond that, each scan job created to run CIS benchmarks on a K8s node has a deterministic name to make sure that we create only one instance. Any attempt to create scan jobs with same names would end up with error returned by Kubernetes API. Could you share at least a partial listing of those pods that are "bombarding" your cluster please?

I tried reproducing this issue with a kind cluster by prefixing kube-bench container image with x, i.e. xdocker.io/aquasec/kube-bench:0.5.0. The operator is actually "getting stuck" because it is waiting for job completion or error, but still I don't see any excessive number of pods being created.

That said, could you share logs of the operator and the snapshot output of the following command in your environment when this problem occur? I want to see how scan jobs are named and confirm in which statuses pods and jobs managed by Starboard are.

$ kubectl get pod,job,deploy -n starboard-operator -o wide
NAME                                     READY   STATUS             RESTARTS   AGE   IP           NODE                 NOMINATED NODE   READINESS GATES
pod/scan-cisbenchmark-845b67c44-2gv9z    0/1     ImagePullBackOff   0          87s   10.244.0.8   kind-control-plane   <none>           <none>
pod/starboard-operator-746769c64-9fwj5   1/1     Running            0          16m   10.244.0.5   kind-control-plane   <none>           <none>

NAME                                    COMPLETIONS   DURATION   AGE   CONTAINERS   IMAGES                                SELECTOR
job.batch/scan-cisbenchmark-845b67c44   0/1           87s        87s   kube-bench   xdocker.io/aquasec/kube-bench:0.5.0   controller-uid=e6165dcc-0986-469f-8084-76af040a6fea

NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                                        SELECTOR
deployment.apps/starboard-operator   1/1     1            1           16m   operator     docker.io/aquasec/starboard-operator:0.10.3   app=starboard-operator

Please elaborate also on RKE/Rancher deployment. Is it open source or commercial? Which version did you use and what are the precise steps to reproduce including numer of K8s nodes?

After delete my custom config again and applying starboard operator with kubectl apply -f https://raw.githubusercontent.com/aquasecurity/starboard/v0.10.3/deploy/static/06-starboard-operator.deployment.yaml (and editing starboard-operator deployment to use our private registry):

pod/scan-cisbenchmark-78758d4f6d-slxms    0/1     Terminating   0          9s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-sqlcl    0/1     Terminating   0          10s   <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-sw6v6    0/1     Terminating   0          9s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-sxn79    0/1     Terminating   0          11s   <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-tm7px    0/1     Terminating   0          5s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-tnkkn    0/1     Terminating   0          4s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-trtpg    0/1     Terminating   0          13s   <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-tvqwf    0/1     Terminating   0          7s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-v9627    0/1     Terminating   0          6s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-v9chb    0/1     Terminating   0          10s   <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-vflvx    0/1     Terminating   0          4s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-vft7f    0/1     Terminating   0          6s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-vnbs6    0/1     Terminating   0          13s   <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-vqjjw    0/1     Terminating   0          1s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-vv2j9    0/1     Terminating   0          8s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-vx5p5    0/1     Terminating   0          3s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-w5qpw    0/1     Terminating   0          12s   <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-wbf72    0/1     Terminating   0          12s   <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-wcmm7    0/1     Terminating   0          10s   <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-wm86v    0/1     Terminating   0          3s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-wmjqz    0/1     Terminating   0          11s   <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-wp77b    0/1     Terminating   0          13s   <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-x2kfc    0/1     Terminating   0          5s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-x2wr5    0/1     Terminating   0          13s   <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-x6zp6    0/1     Terminating   0          7s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-x888m    0/1     Terminating   0          7s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-x8q2c    0/1     Terminating   0          9s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-xbjqw    0/1     Terminating   0          8s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-xxvc9    0/1     Terminating   0          4s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-z49xf    0/1     Terminating   0          4s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-z5sxr    0/1     Terminating   0          3s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-z888g    0/1     Terminating   0          6s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-zbgcd    0/1     Terminating   0          3s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-zclmh    0/1     Terminating   0          8s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-zqb9t    0/1     Terminating   0          8s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-zqftl    0/1     Terminating   0          9s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-zs24c    0/1     Terminating   0          6s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-zvbr7    0/1     Terminating   0          14s   <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/scan-cisbenchmark-78758d4f6d-zx747    0/1     Terminating   0          5s    <none>          kubernetes-dev-1-etcd-1.mgmt.hss.int   <none>           <none>
pod/starboard-operator-548b58dcd9-fqxkv   1/1     Running       0          23s   10.42.151.150   kubernetes-dev-1-node-4.mgmt.hss.int   <none>           <none>

NAME                                     COMPLETIONS   DURATION   AGE   CONTAINERS   IMAGES                               SELECTOR
job.batch/scan-cisbenchmark-78758d4f6d   0/1           15s        15s   kube-bench   docker.io/aquasec/kube-bench:0.5.0   controller-uid=33086cee-c898-446b-aca7-28afb53b4873

NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                                             SELECTOR
deployment.apps/starboard-operator   1/1     1            1           81s   operator     harbor.hss.int/aquasec/starboard-operator:0.10.3   app=starboard-operator

Growing amount of job pods:

9:33:25 0> k get pods -n starboard-operator | wc -l
872
9:33:28 0> k get pods -n starboard-operator | wc -l
903
9:33:37 0> k get pods -n starboard-operator | wc -l
972

Here a description of a quite young job pod:

9:39:11 0> k describe pod scan-cisbenchmark-78758d4f6d-zm4bl                                                                                                       [22/1588]
Name:                      scan-cisbenchmark-78758d4f6d-zm4bl
Namespace:                 starboard-operator
Priority:                  0
Node:                      kubernetes-dev-1-etcd-1.mgmt.hss.int/
Labels:                    app.kubernetes.io/managed-by=starboard
                           controller-uid=b2fc3614-c5ae-4f7b-8140-b31ba4b45129
                           job-name=scan-cisbenchmark-78758d4f6d
                           kubeBenchReport.scanner=true
                           starboard.resource.kind=Node
                           starboard.resource.name=kubernetes-dev-1-etcd-1.mgmt.hss.int
Annotations:               seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:                    Terminating (lasts 60s)
Termination Grace Period:  30s
IP:
IPs:                       <none>
Controlled By:             Job/scan-cisbenchmark-78758d4f6d
Containers:
  kube-bench:
    Image:      docker.io/aquasec/kube-bench:0.5.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
    Args:
      -c
      kube-bench --json 2> /dev/null
    Limits:
      cpu:     300m
      memory:  300M
    Requests:
      cpu:        50m
      memory:     50M
    Environment:  <none>
    Mounts:
      /etc/kubernetes from etc-kubernetes (ro)
      /etc/systemd from etc-systemd (ro)
      /usr/local/mount-from-host/bin from usr-bin (ro)
      /var/lib/etcd from var-lib-etcd (ro)
      /var/lib/kubelet from var-lib-kubelet (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from starboard-operator-token-r4dnp (ro)
Volumes:
  var-lib-etcd:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/etcd
    HostPathType:
  var-lib-kubelet:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet
    HostPathType:
  etc-systemd:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/systemd
    HostPathType:
  etc-kubernetes:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes
    HostPathType:
  usr-bin:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/bin
    HostPathType:
  starboard-operator-token-r4dnp:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  starboard-operator-token-r4dnp
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

Here events of a little older Terminating pod job:

Events:
  Type     Reason       Age    From     Message
  ----     ------       ----   ----     -------
  Warning  FailedMount  5m13s  kubelet  Unable to attach or mount volumes: unmounted volumes=[var-lib-kubelet etc-systemd etc-kubernetes usr-bin starboard-operator-token-r4dnp var-lib-etcd], unattached volumes=[var-lib-kubelet etc-systemd etc-kubernetes usr-bin starboard-operator-token-r4dnp var-lib-etcd]: timed out waiting for the condition

Timoses avatar Jul 06 '21 07:07 Timoses

sorry for the delay @danielpacak. But what @Timoses posted is same thing we are seeing.

sazo avatar Jul 07 '21 14:07 sazo

For the record, I've spun up RKE cluster and deployed Starboard Operator v0.10.3 and the CISKubeBenchReport was created without any issues.

$ kubectl get ciskubebenchreports.aquasecurity.github.io -o wide
NAME            SCANNER      AGE   FAIL   WARN   INFO   PASS
ip-10-0-1-245   kube-bench   25m   29     44     0      49
$ kubectl get node -o wide
NAME            STATUS   ROLES                      AGE   VERSION    INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
ip-10-0-1-245   Ready    controlplane,etcd,worker   32m   v1.20.11   10.0.1.245    <none>        Ubuntu 18.04.5 LTS   5.4.0-1045-aws   docker://20.10.6
$ kubectl get deploy -A -o wide
NAMESPACE            NAME                      READY   UP-TO-DATE   AVAILABLE   AGE     CONTAINERS                IMAGES                                                                  SELECTOR
cattle-system        cattle-cluster-agent      1/1     1            1           32m     cluster-register          rancher/rancher-agent:v2.5.8                                            app=cattle-cluster-agent
default              nginx                     1/1     1            1           6m12s   nginx                     nginx:1.16                                                              app=nginx
fleet-system         fleet-agent               1/1     1            1           32m     fleet-agent               rancher/fleet-agent:v0.3.5                                              app=fleet-agent
ingress-nginx        default-http-backend      1/1     1            1           33m     default-http-backend      rancher/mirrored-nginx-ingress-controller-defaultbackend:1.5-rancher1   app=default-http-backend
kube-system          calico-kube-controllers   1/1     1            1           33m     calico-kube-controllers   rancher/mirrored-calico-kube-controllers:v3.17.2                        k8s-app=calico-kube-controllers
kube-system          coredns                   1/1     1            1           33m     coredns                   rancher/mirrored-coredns-coredns:1.8.0                                  k8s-app=kube-dns
kube-system          coredns-autoscaler        1/1     1            1           33m     autoscaler                rancher/mirrored-cluster-proportional-autoscaler:1.8.1                  k8s-app=coredns-autoscaler
kube-system          metrics-server            1/1     1            1           33m     metrics-server            rancher/mirrored-metrics-server:v0.4.1                                  k8s-app=metrics-server
starboard-operator   starboard-operator        1/1     1            1           24m     operator                  docker.io/aquasec/starboard-operator:0.10.3                             app=starboard-operator

danielpacak avatar Oct 12 '21 07:10 danielpacak

After correcting the imageRef of kubebench it fails as it is unable to mount various mounts. We deploy with RKE/Rancher which uses different directories. It still produces a bombardment of pods ; ).

@Timoses Can you share your env config or point me to Rancher docs to better understand the difference in default directories used by Rancher / K8s components? If the paths are different we can make them configurable.

danielpacak avatar Oct 12 '21 08:10 danielpacak