gatekeeper Store violations as instances of new CRD type violations.gatekeeper.sh

Current solution and problems Currently there are two locations to get the violations details identified by Gatekeeper

Violations are available in status sub resource of a constraint.
Violations are available in the Gatekeeper pod logs.

The number of violations that can be stored in status sub resource of a constraint is limited by etcd object size limit which is 1MB. For large clusters (e.g. 10K pods), its not possible to store all violations on the constraint resource.

The violations that are stored in Gatekeeper pods logs are not persistent as logs would be lost if Gatekeeper pod gets evicted or restarted. Also the logs rotate after a certain size (10MB). Logs are not user friendly and requires extra processing on consumer side to get the latest snapshot of the violations on cluster. In future, if Gatekeeper moves to distributed audit model (multiple audit pods where each pod is handling audit of a part of a cluster), then consumer should query all the audit pods to look for violations.

Describe the solution you'd like Create a new CRD type 'violations.gatekeeper.sh'

Store violations as instances of this CRD. For example, if a constraint evaluation results in 10 violations, 10 violation CRD instances would be written to the cluster.
Query violations using labels. The constraint name, pod name, namespace name can be added as labels to the violations and the violations can be queried using any of the labels.
Query violations in chunks. In a large cluster, if there are more number of violations, consumer can choose to read the violations chunk by chunk by specifying the list size limit to API server query.
Violations are persisted in etcd and so consumer can get the entire snapshot of cluster at any point in time.

Anything else you would like to add:

Optimization: If an instance of CRD is created for every violation, then it would put lot of pressure on API server. An optimization that can be done is to store batches of violations in one instance of CRD. For example: 1000 violations can be stored in batches of 100 which is 10 instances of CRD are created and each instance holds 100 violations.
Violations deletion: Like Kubernetes events, the violations reported older than 1 hour can be deleted to remove the stale data. Gatekeeper needs to add extra logic for deleting stale violation data.

Environment:

Gatekeeper version: beta8
Kubernetes version: (use kubectl version): 1.14

Jul 01 '20 07:07 RamyasreeChakka

Might be relevant: https://github.com/kubernetes-sigs/wg-policy-prototypes/blob/master/policy-report/README.md

Jul 08 '20 16:07 rficcaglia

Also the etcd 1MB limitation came up today in the WG discussion @ritazh presented a nice matrix of various concerns - maybe she can link here?

Jul 08 '20 16:07 rficcaglia

re-evaluate if wg-policy policy-report has addressed the scalability issues

Apr 13 '22 16:04 sozercan

There's now a v1beta1 of the PolicyReport CRD that defines some configuration to help with scalability. Please take a look and see if it addresses the concerns here: https://github.com/kubernetes-sigs/wg-policy-prototypes/blob/22764d64b0c3f79d2293b67000b4d9ebca197623/policy-report/crd/v1beta1/wgpolicyk8s.io_policyreports.yaml#L714-L743

Oct 09 '23 14:10 gparvin