Bombardment of 'scan-cisbenchmark-' pods
What steps did you take and what happened:
- Installation steps from here: https://aquasecurity.github.io/starboard/v0.10.3/operator/installation/kubectl/
- Edited Starboard deployment to use image from our private image registry
- Lots of 'scan-cisbenchmark-' pods were generated (over 2600)
- Why? Starboard configmap by default is configured to use container images from docker.io which is not reachable from our clusters. Therefore the pods terminated.
- The jobs are now slowly removed by Kubernetes Controller.
What did you expect to happen: The benchmark job should be configured accordingly to not bombard the namespace with pods.
Anything else you would like to add:
/
Environment:
- Starboard version (use
starboard version):
Starboard Version: {Version:0.10.3 Commit:5bd33431a239b98be4a3287563b8664a9b3d5707 Date:2021-05-14T12:20:34Z}
- Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:09:25Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6", GitCommit:"8a62859e515889f07e3e3be6a1080413f17cf2c3", GitTreeState:"clean", BuildDate:"2021-04-15T03:19:55Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
- OS (macOS 10.15, Windows 10, Ubuntu 19.10 etc): Fedora 33
After correcting the imageRef of kubebench it fails as it is unable to mount various mounts. We deploy with RKE/Rancher which uses different directories. It still produces a bombardment of pods ; ).
👋 @Timoses I'm sorry to hear that Starboard is causing such trouble. We do actually have configurable limit on number of scan jobs created by Starboard Operator to compare programmatically the number of active scan jobs to the limit (OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT). We'll try to reproduce and see what is going on.
Seeing the same thing in our RKE/rancher cluster. Also following https://aquasecurity.github.io/starboard/v0.10.3/operator/installation/kubectl/
k8s-version: 1.19.
@Timoses I started looking into this issue and I don't quite understand what is happening in your cluster:
Starboard configmap by default is configured to use container images from docker.io which is not reachable from our clusters. Therefore the pods terminated.
AFAIK, a pod cannot be even started if the image cannot be pulled from a registry and the pod will get stuck in the ImagePullBackOff or ErrImagePull status. How come they were terminated by K8s controller manager?
Beyond that, each scan job created to run CIS benchmarks on a K8s node has a deterministic name to make sure that we create only one instance. Any attempt to create scan jobs with same names would end up with error returned by Kubernetes API. Could you share at least a partial listing of those pods that are "bombarding" your cluster please?
I tried reproducing this issue with a kind cluster by prefixing kube-bench container image with x, i.e. xdocker.io/aquasec/kube-bench:0.5.0. The operator is actually "getting stuck" because it is waiting for job completion or error, but still I don't see any excessive number of pods being created.
That said, could you share logs of the operator and the snapshot output of the following command in your environment when this problem occur? I want to see how scan jobs are named and confirm in which statuses pods and jobs managed by Starboard are.
$ kubectl get pod,job,deploy -n starboard-operator -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/scan-cisbenchmark-845b67c44-2gv9z 0/1 ImagePullBackOff 0 87s 10.244.0.8 kind-control-plane <none> <none>
pod/starboard-operator-746769c64-9fwj5 1/1 Running 0 16m 10.244.0.5 kind-control-plane <none> <none>
NAME COMPLETIONS DURATION AGE CONTAINERS IMAGES SELECTOR
job.batch/scan-cisbenchmark-845b67c44 0/1 87s 87s kube-bench xdocker.io/aquasec/kube-bench:0.5.0 controller-uid=e6165dcc-0986-469f-8084-76af040a6fea
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
deployment.apps/starboard-operator 1/1 1 1 16m operator docker.io/aquasec/starboard-operator:0.10.3 app=starboard-operator
Please elaborate also on RKE/Rancher deployment. Is it open source or commercial? Which version did you use and what are the precise steps to reproduce including numer of K8s nodes?
Seeing the same thing in our RKE/rancher cluster. Also following https://aquasecurity.github.io/starboard/v0.10.3/operator/installation/kubectl/ k8s-version: 1.19.
@sazo Could you share the list nodes and the list of scan jobs that are created in the operator namespace in your cluster please? If possible, please share albo the logs of the operator
@Timoses I started looking into this issue and I don't quite understand what is happening in your cluster:
Starboard configmap by default is configured to use container images from docker.io which is not reachable from our clusters. Therefore the pods terminated.
AFAIK, a pod cannot be even started if the image cannot be pulled from a registry and the pod will get stuck in the
ImagePullBackOfforErrImagePullstatus. How come they were terminated by K8s controller manager?Beyond that, each scan job created to run CIS benchmarks on a K8s node has a deterministic name to make sure that we create only one instance. Any attempt to create scan jobs with same names would end up with error returned by Kubernetes API. Could you share at least a partial listing of those pods that are "bombarding" your cluster please?
I tried reproducing this issue with a kind cluster by prefixing kube-bench container image with
x, i.e.xdocker.io/aquasec/kube-bench:0.5.0. The operator is actually "getting stuck" because it is waiting for job completion or error, but still I don't see any excessive number of pods being created.That said, could you share logs of the operator and the snapshot output of the following command in your environment when this problem occur? I want to see how scan jobs are named and confirm in which statuses pods and jobs managed by Starboard are.
$ kubectl get pod,job,deploy -n starboard-operator -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/scan-cisbenchmark-845b67c44-2gv9z 0/1 ImagePullBackOff 0 87s 10.244.0.8 kind-control-plane <none> <none> pod/starboard-operator-746769c64-9fwj5 1/1 Running 0 16m 10.244.0.5 kind-control-plane <none> <none> NAME COMPLETIONS DURATION AGE CONTAINERS IMAGES SELECTOR job.batch/scan-cisbenchmark-845b67c44 0/1 87s 87s kube-bench xdocker.io/aquasec/kube-bench:0.5.0 controller-uid=e6165dcc-0986-469f-8084-76af040a6fea NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR deployment.apps/starboard-operator 1/1 1 1 16m operator docker.io/aquasec/starboard-operator:0.10.3 app=starboard-operatorPlease elaborate also on RKE/Rancher deployment. Is it open source or commercial? Which version did you use and what are the precise steps to reproduce including numer of K8s nodes?
After delete my custom config again and applying starboard operator with kubectl apply -f https://raw.githubusercontent.com/aquasecurity/starboard/v0.10.3/deploy/static/06-starboard-operator.deployment.yaml (and editing starboard-operator deployment to use our private registry):
pod/scan-cisbenchmark-78758d4f6d-slxms 0/1 Terminating 0 9s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-sqlcl 0/1 Terminating 0 10s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-sw6v6 0/1 Terminating 0 9s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-sxn79 0/1 Terminating 0 11s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-tm7px 0/1 Terminating 0 5s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-tnkkn 0/1 Terminating 0 4s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-trtpg 0/1 Terminating 0 13s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-tvqwf 0/1 Terminating 0 7s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-v9627 0/1 Terminating 0 6s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-v9chb 0/1 Terminating 0 10s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-vflvx 0/1 Terminating 0 4s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-vft7f 0/1 Terminating 0 6s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-vnbs6 0/1 Terminating 0 13s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-vqjjw 0/1 Terminating 0 1s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-vv2j9 0/1 Terminating 0 8s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-vx5p5 0/1 Terminating 0 3s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-w5qpw 0/1 Terminating 0 12s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-wbf72 0/1 Terminating 0 12s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-wcmm7 0/1 Terminating 0 10s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-wm86v 0/1 Terminating 0 3s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-wmjqz 0/1 Terminating 0 11s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-wp77b 0/1 Terminating 0 13s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-x2kfc 0/1 Terminating 0 5s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-x2wr5 0/1 Terminating 0 13s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-x6zp6 0/1 Terminating 0 7s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-x888m 0/1 Terminating 0 7s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-x8q2c 0/1 Terminating 0 9s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-xbjqw 0/1 Terminating 0 8s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-xxvc9 0/1 Terminating 0 4s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-z49xf 0/1 Terminating 0 4s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-z5sxr 0/1 Terminating 0 3s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-z888g 0/1 Terminating 0 6s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-zbgcd 0/1 Terminating 0 3s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-zclmh 0/1 Terminating 0 8s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-zqb9t 0/1 Terminating 0 8s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-zqftl 0/1 Terminating 0 9s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-zs24c 0/1 Terminating 0 6s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-zvbr7 0/1 Terminating 0 14s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/scan-cisbenchmark-78758d4f6d-zx747 0/1 Terminating 0 5s <none> kubernetes-dev-1-etcd-1.mgmt.hss.int <none> <none>
pod/starboard-operator-548b58dcd9-fqxkv 1/1 Running 0 23s 10.42.151.150 kubernetes-dev-1-node-4.mgmt.hss.int <none> <none>
NAME COMPLETIONS DURATION AGE CONTAINERS IMAGES SELECTOR
job.batch/scan-cisbenchmark-78758d4f6d 0/1 15s 15s kube-bench docker.io/aquasec/kube-bench:0.5.0 controller-uid=33086cee-c898-446b-aca7-28afb53b4873
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
deployment.apps/starboard-operator 1/1 1 1 81s operator harbor.hss.int/aquasec/starboard-operator:0.10.3 app=starboard-operator
Growing amount of job pods:
9:33:25 0> k get pods -n starboard-operator | wc -l
872
9:33:28 0> k get pods -n starboard-operator | wc -l
903
9:33:37 0> k get pods -n starboard-operator | wc -l
972
Here a description of a quite young job pod:
9:39:11 0> k describe pod scan-cisbenchmark-78758d4f6d-zm4bl [22/1588]
Name: scan-cisbenchmark-78758d4f6d-zm4bl
Namespace: starboard-operator
Priority: 0
Node: kubernetes-dev-1-etcd-1.mgmt.hss.int/
Labels: app.kubernetes.io/managed-by=starboard
controller-uid=b2fc3614-c5ae-4f7b-8140-b31ba4b45129
job-name=scan-cisbenchmark-78758d4f6d
kubeBenchReport.scanner=true
starboard.resource.kind=Node
starboard.resource.name=kubernetes-dev-1-etcd-1.mgmt.hss.int
Annotations: seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Terminating (lasts 60s)
Termination Grace Period: 30s
IP:
IPs: <none>
Controlled By: Job/scan-cisbenchmark-78758d4f6d
Containers:
kube-bench:
Image: docker.io/aquasec/kube-bench:0.5.0
Port: <none>
Host Port: <none>
Command:
sh
Args:
-c
kube-bench --json 2> /dev/null
Limits:
cpu: 300m
memory: 300M
Requests:
cpu: 50m
memory: 50M
Environment: <none>
Mounts:
/etc/kubernetes from etc-kubernetes (ro)
/etc/systemd from etc-systemd (ro)
/usr/local/mount-from-host/bin from usr-bin (ro)
/var/lib/etcd from var-lib-etcd (ro)
/var/lib/kubelet from var-lib-kubelet (ro)
/var/run/secrets/kubernetes.io/serviceaccount from starboard-operator-token-r4dnp (ro)
Volumes:
var-lib-etcd:
Type: HostPath (bare host directory volume)
Path: /var/lib/etcd
HostPathType:
var-lib-kubelet:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet
HostPathType:
etc-systemd:
Type: HostPath (bare host directory volume)
Path: /etc/systemd
HostPathType:
etc-kubernetes:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes
HostPathType:
usr-bin:
Type: HostPath (bare host directory volume)
Path: /usr/bin
HostPathType:
starboard-operator-token-r4dnp:
Type: Secret (a volume populated by a Secret)
SecretName: starboard-operator-token-r4dnp
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Here events of a little older Terminating pod job:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 5m13s kubelet Unable to attach or mount volumes: unmounted volumes=[var-lib-kubelet etc-systemd etc-kubernetes usr-bin starboard-operator-token-r4dnp var-lib-etcd], unattached volumes=[var-lib-kubelet etc-systemd etc-kubernetes usr-bin starboard-operator-token-r4dnp var-lib-etcd]: timed out waiting for the condition
sorry for the delay @danielpacak. But what @Timoses posted is same thing we are seeing.
For the record, I've spun up RKE cluster and deployed Starboard Operator v0.10.3 and the CISKubeBenchReport was created without any issues.
$ kubectl get ciskubebenchreports.aquasecurity.github.io -o wide
NAME SCANNER AGE FAIL WARN INFO PASS
ip-10-0-1-245 kube-bench 25m 29 44 0 49
$ kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-0-1-245 Ready controlplane,etcd,worker 32m v1.20.11 10.0.1.245 <none> Ubuntu 18.04.5 LTS 5.4.0-1045-aws docker://20.10.6
$ kubectl get deploy -A -o wide
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
cattle-system cattle-cluster-agent 1/1 1 1 32m cluster-register rancher/rancher-agent:v2.5.8 app=cattle-cluster-agent
default nginx 1/1 1 1 6m12s nginx nginx:1.16 app=nginx
fleet-system fleet-agent 1/1 1 1 32m fleet-agent rancher/fleet-agent:v0.3.5 app=fleet-agent
ingress-nginx default-http-backend 1/1 1 1 33m default-http-backend rancher/mirrored-nginx-ingress-controller-defaultbackend:1.5-rancher1 app=default-http-backend
kube-system calico-kube-controllers 1/1 1 1 33m calico-kube-controllers rancher/mirrored-calico-kube-controllers:v3.17.2 k8s-app=calico-kube-controllers
kube-system coredns 1/1 1 1 33m coredns rancher/mirrored-coredns-coredns:1.8.0 k8s-app=kube-dns
kube-system coredns-autoscaler 1/1 1 1 33m autoscaler rancher/mirrored-cluster-proportional-autoscaler:1.8.1 k8s-app=coredns-autoscaler
kube-system metrics-server 1/1 1 1 33m metrics-server rancher/mirrored-metrics-server:v0.4.1 k8s-app=metrics-server
starboard-operator starboard-operator 1/1 1 1 24m operator docker.io/aquasec/starboard-operator:0.10.3 app=starboard-operator
After correcting the imageRef of kubebench it fails as it is unable to mount various mounts. We deploy with RKE/Rancher which uses different directories. It still produces a bombardment of pods ; ).
@Timoses Can you share your env config or point me to Rancher docs to better understand the difference in default directories used by Rancher / K8s components? If the paths are different we can make them configurable.