gatekeeper icon indicating copy to clipboard operation
gatekeeper copied to clipboard

gatekeeper continuously updating the constraint templates on openshift cluster

Open goyalv opened this issue 4 years ago • 16 comments

What steps did you take and what happened: [A clear and concise description of what the bug is.] Deployed gatekeeper release using prebuilt image with following command

kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/master/deploy/gatekeeper.yaml Fell into issue #790 where gatekeeper pod was not coming up, used workaround: Edit the clusterrole "gatekeeper-manager-role" to add a rule

  • apiGroups:
    • security.openshift.io resourceNames:
    • nonroot resources:
    • securitycontextconstraints verbs:
    • use

Edited the scc "nonroot" to add seccompProfiles:

  • runtime/default

Using above workaround gatekeeper pods came up and started running.

Deployed a constraint template and saw that it was continuously getting updated, and differnece between two updates was only the resource version

Logs from gatekeeper-controller-manager pod that kept repeating

"level":"info","ts":1597706603.8988822,"logger":"controller","msg":"loading code into OPA","kind":"ConstraintTemplate","process":"constraint_template_controller","name":"try-system-block-privilege-escalation-v1","crdName":"try-system-block-privilege-escalation-v1.constraints.gatekeeper.sh"} {"level":"info","ts":1597706603.9059987,"logger":"controller","msg":"[readiness] observed ConstraintTemplate","kind":"ConstraintTemplate","process":"constraint_template_controller","name":"try-system-block-privilege-escalation-v1","crdName":"try-system-block-privilege-escalation-v1.constraints.gatekeeper.sh","name":"try-system-block-privilege-escalation-v1"} {"level":"info","ts":1597706603.906382,"logger":"controller","msg":"updating crd","kind":"ConstraintTemplate","process":"constraint_template_controller","name":"try-system-block-privilege-escalation-v1","crdName":"try-system-block-privilege-escalation-v1.constraints.gatekeeper.sh"} {"level":"info","ts":1597706603.914981,"logger":"controller","msg":"making sure constraint is in watcher registry","kind":"ConstraintTemplate","process":"constraint_template_controller","name":"try-system-block-privilege-escalation-v1","crdName":"try-system-block-privilege-escalation-v1.constraints.gatekeeper.sh"} {"level":"info","ts":1597706603.9379447,"logger":"controller","msg":"template was updated","kind":"ConstraintTemplate","process":"constraint_template_controller","event_type":"template_updated","template_name":"try-system-block-privilege-escalation-v1"}

What did you expect to happen: Constraint template gets created and is updated by audit only after a certain interval.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment: openshift cluster

  • Gatekeeper version: openpolicyagent/gatekeeper:v3.1.0-rc.1
  • Kubernetes version: (use kubectl version): Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.2", GitCommit:"66049e3b21efe110454d67df4fa62b08ea79a19b", GitTreeState:"clean", BuildDate:"2019-05-16T18:55:03Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0", GitTreeState:"clean", BuildDate:"2020-07-21T11:45:09Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

goyalv avatar Aug 18 '20 18:08 goyalv

My first guess is that there is some sort of coercion happening on the API server that Gatekeeper detects as drift. Maybe setting of default values on the CRD?

Would it be possible to add some log statements printing out the spec of the CRD that Gatekeeper thinks it's applying from the reconciler and comparing it to the actual CRD as stored on the API server?

maxsmythe avatar Aug 18 '20 19:08 maxsmythe

I think that would be beneficial, also looking at code I see

	newCRD = currentCRD.DeepCopy()
	newCRD.Spec = proposedCRD.Spec

and then doing check if !reflect.DeepEqual(newCRD, currentCRD)

In above instead of shallow copy of Spec, should we do the deep copy of Spec

not sure why this issue only manifests in openshift cluster, I did test gatekeeper on gcp, azure, aws , kind but did not observe this issue on them.

goyalv avatar Aug 18 '20 19:08 goyalv

I wouldn't be surprised if the issue was something like defaulting.

The DeepCopy() thing is worth a try, though I'd be surprised if it fixed anything as there don't appear to be any lines modifying the spec between the assignment and the DeepEqual, and multithreading shouldn't be a concern within the reconciler itself.

You could use cmp.Diff from https://github.com/google/go-cmp to see exactly what diffs are causing DeepEqual() to return false.

maxsmythe avatar Aug 18 '20 20:08 maxsmythe

Tried on a 4.6 cluster, not seeing this happening.

Curiously enough, I am seeing 3 iterations of the webhook logs updating the CR happening before it stabilizes.

In all of the three, the difference is the following one (pardon the tabs):

  \t\t\t\t\t\t\t\t\t\"labels\": {
  \t\t\t\t\t\t\t\t\t\t... // 21 identical fields
  \t\t\t\t\t\t\t\t\t\tMinProperties: nil,
  \t\t\t\t\t\t\t\t\t\tRequired:      nil,
- \t\t\t\t\t\t\t\t\t\tItems:         s\"&JSONSchemaPropsOrArray{Schema:nil,JSONSchemas:[]JSONSchemaProps{},}\",
+ \t\t\t\t\t\t\t\t\t\tItems:         nil,

With the left one being the proposed one and the right being the old one. I don't think this is relevant though, because I am seeing this happeing with vanilla kind too. I wonder if this would depend on the k8s version.

Will try to get my hands on a 3.x cluster but it's not that easy.

fedepaol avatar Sep 16 '20 17:09 fedepaol

Interesting. It looks like Items: is part of the JSONSchema for an array. It looks like the diff is between a nil pointer and an empty object. Any way to tell where in the schema this diff is occurring?

This could be caused by the data loss when round tripping between the Golang struct, marshaled JSON and back again. This would lead to the controller always sending an update every time it gets invoked.

3 iterations makes sense to me for this, as that seems like a reasonable number of times for the CRD controller to update the status of the CRD (I know it does so at least twice), plus one event to actually create the object.

It's not great that we are sending unnecessary updates, as it wastes network, but I'm guessing it wouldn't be the root cause.

maxsmythe avatar Sep 17 '20 00:09 maxsmythe

Actually, defaulting is more likely than round-trip errors.

If we are interested in solving the 3-update problem, we could use a semantic comparator for the JSONSchema that understands that nil and an empty object mean the same thing. This is likely more robust than trying to apply defaults using a library, as there is no guarantee the defaults wont change between versions, though there are guarantees that the meaning wont change (per the K8s API design guidelines).

I'm not sure if a semantic comparator like that exists. It should be possible to write one. Fixing it is also probably wont address whatever is causing this bug, so I'll split this to a second issue.

maxsmythe avatar Sep 17 '20 00:09 maxsmythe

Interesting. It looks like Items: is part of the JSONSchema for an array. It looks like the diff is between a nil pointer and an empty object. Any way to tell where in the schema this diff is occurring?

Not sure I got your question right, but I believe it's in the labels.

This could be caused by the data loss when round tripping between the Golang struct, marshaled JSON and back again. This would lead to the controller always sending an update every time it gets invoked.

3 iterations makes sense to me for this, as that seems like a reasonable number of times for the CRD controller to update the status of the CRD (I know it does so at least twice), plus one event to actually create the object.

It's not great that we are sending unnecessary updates, as it wastes network, but I'm guessing it wouldn't be the root cause.

Yes, the thing I don't understand is why it stops after 3 iterations.

fedepaol avatar Sep 17 '20 07:09 fedepaol

The path seems to be spec.Validation.Properties["spec"].Properties["parameters].Properties["labels"]:

  \tSpec: v1beta1.CustomResourceDefinitionSpec{
  \t\t... // 2 identical fields
  \t\tNames: v1beta1.CustomResourceDefinitionNames{Plural: \"k8srequiredlabels\", Singular: \"k8srequiredlabels\", Kind: \"K8sRequiredLabels\", ListKind: \"K8sRequiredLabelsList\", Categories: []string{\"constraint\"}},
  \t\tScope: \"Cluster\",
  \t\tValidation: &v1beta1.CustomResourceValidation{
  \t\t\tOpenAPIV3Schema: &v1beta1.JSONSchemaProps{
  \t\t\t\t... // 26 identical fields
  \t\t\t\tAnyOf: nil,
  \t\t\t\tNot:   nil,
  \t\t\t\tProperties: map[string]v1beta1.JSONSchemaProps{
  \t\t\t\t\t\"metadata\": {Properties: map[string]v1beta1.JSONSchemaProps{\"name\": {Type: \"string\", MaxLength: &63}}},
  \t\t\t\t\t\"spec\": {
  \t\t\t\t\t\t... // 26 identical fields
  \t\t\t\t\t\tAnyOf: nil,
  \t\t\t\t\t\tNot:   nil,
  \t\t\t\t\t\tProperties: map[string]v1beta1.JSONSchemaProps{
  \t\t\t\t\t\t\t\"enforcementAction\": {Type: \"string\"},
  \t\t\t\t\t\t\t\"match\":             {Properties: map[string]v1beta1.JSONSchemaProps{\"excludedNamespaces\": {Type: \"array\", Items: &v1beta1.JSONSchemaPropsOrArray{Schema: &v1beta1.JSONSchemaProps{Type: \"string\"}}}, \"kinds\": {Type: \"array\", Items: &v1beta1.JSONSchemaPropsOrArray{Schema: &v1beta1.JSONSchemaProps{Properties: map[string]v1beta1.JSONSchemaProps{\"apiGroups\": {Items: &v1beta1.JSONSchemaPropsOrArray{Schema: &v1beta1.JSONSchemaProps{Type: \"string\"}}}, \"kinds\": {Items: &v1beta1.JSONSchemaPropsOrArray{Schema: &v1beta1.JSONSchemaProps{Type: \"string\"}}}}}}}, \"labelSelector\": {Properties: map[string]v1beta1.JSONSchemaProps{\"matchExpressions\": {Type: \"array\", Items: &v1beta1.JSONSchemaPropsOrArray{Schema: &v1beta1.JSONSchemaProps{Properties: map[string]v1beta1.JSONSchemaProps{\"key\": {Type: \"string\"}, \"operator\": {Type: \"string\", Enum: []v1beta1.JSON{{Raw: []uint8{0x22, 0x49, 0x6e, 0x22}}, {Raw: []uint8{0x22, 0x4e, 0x6f, 0x74, 0x49, 0x6e, 0x22}}, {Raw: []uint8{0x22, 0x45, 0x78, 0x69, 0x73, 0x74, 0x73, 0x22}}, {Raw: []uint8{0x22, 0x44, 0x6f, 0x65, 0x73, 0x4e, 0x6f, 0x74, 0x45, 0x78, 0x69, 0x73, 0x74, 0x22}}}}, \"values\": {Type: \"array\", Items: &v1beta1.JSONSchemaPropsOrArray{Schema: &v1beta1.JSONSchemaProps{Type: \"string\"}}}}}}}}}, \"namespaceSelector\": {Properties: map[string]v1beta1.JSONSchemaProps{\"matchExpressions\": {Type: \"array\", Items: &v1beta1.JSONSchemaPropsOrArray{Schema: &v1beta1.JSONSchemaProps{Properties: map[string]v1beta1.JSONSchemaProps{\"key\": {Type: \"string\"}, \"operator\": {Type: \"string\", Enum: []v1beta1.JSON{{Raw: []uint8{0x22, 0x49, 0x6e, 0x22}}, {Raw: []uint8{0x22, 0x4e, 0x6f, 0x74, 0x49, 0x6e, 0x22}}, {Raw: []uint8{0x22, 0x45, 0x78, 0x69, 0x73, 0x74, 0x73, 0x22}}, {Raw: []uint8{0x22, 0x44, 0x6f, 0x65, 0x73, 0x4e, 0x6f, 0x74, 0x45, 0x78, 0x69, 0x73, 0x74, 0x22}}}}, \"values\": {Type: \"array\", Items: &v1beta1.JSONSchemaPropsOrArray{Schema: &v1beta1.JSONSchemaProps{Type: \"string\"}}}}}}}}}, \"namespaces\": {Type: \"array\", Items: &v1beta1.JSONSchemaPropsOrArray{Schema: &v1beta1.JSONSchemaProps{Type: \"string\"}}}, \"scope\": {Type: \"string\", Enum: []v1beta1.JSON{{Raw: []uint8{0x22, 0x2a, 0x22}}, {Raw: []uint8{0x22, 0x43, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x22}}, {Raw: []uint8{0x22, 0x4e, 0x61, 0x6d, 0x65, 0x73, 0x70, 0x61, 0x63, 0x65, 0x64, 0x22}}}}}},
  \t\t\t\t\t\t\t\"parameters\": {
  \t\t\t\t\t\t\t\t... // 26 identical fields
  \t\t\t\t\t\t\t\tAnyOf: nil,
  \t\t\t\t\t\t\t\tNot:   nil,
  \t\t\t\t\t\t\t\tProperties: map[string]v1beta1.JSONSchemaProps{
  \t\t\t\t\t\t\t\t\t\"labels\": {
  \t\t\t\t\t\t\t\t\t\t... // 21 identical fields
  \t\t\t\t\t\t\t\t\t\tMinProperties: nil,
  \t\t\t\t\t\t\t\t\t\tRequired:      nil,
- \t\t\t\t\t\t\t\t\t\tItems:         s\"&JSONSchemaPropsOrArray{Schema:nil,JSONSchemas:[]JSONSchemaProps{},}\",
+ \t\t\t\t\t\t\t\t\t\tItems:         nil,
  \t\t\t\t\t\t\t\t\t\tAllOf:         nil,
  \t\t\t\t\t\t\t\t\t\tOneOf:         nil,

Still digging

fedepaol avatar Sep 17 '20 12:09 fedepaol

Thank you for finding the problem that converges. Were you able to reproduce the misbehavior described in the initial issue?

maxsmythe avatar Sep 18 '20 00:09 maxsmythe

This is getting a rabbit hole. So, I somewhat managed to get an ocp 3.11 version:

oc version
Client Version: v4.2.0
Kubernetes Version: v1.11.0+d4cacc0

The kubernetes version is tied to the cluster version. The CRD is failing because of:

{"level":"error","ts":1600429782.4145458,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"constrainttemplate-controller","request":"/k8srequiredlabels","error":"Internal error occurred: no kind \"CustomResourceDefinition\" is registered for version \"apiextensions.k8s.io/v1beta1\" in scheme \"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:29\

There is a knowledge base article that says this is caused by the presence of validating webhooks (which are not supported at that version). Removing the webhook made the trick and caused 6 updates (which still is not infinite).

The diffs I am seeing now are:

&v1beta1.CustomResourceDefinition{
  \tTypeMeta:   v1.TypeMeta{Kind: \"CustomResourceDefinition\", APIVersion: \"apiextensions.k8s.io/v1beta1\"},
  \tObjectMeta: v1.ObjectMeta{Name: \"k8srequiredlabels.constraints.gatekeeper.sh\", SelfLink: \"/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/k8srequiredlabels.constraints.gatekeeper.sh\", UID: \"5f02710a-f9a6-11ea-9495-525500d15501\", ResourceVersion: \"8913\", Generation: 1, CreationTimestamp: v1.Time{Time: s\"2020-09-18 11:59:14 +0000 UTC\"}, OwnerReferences: []v1.OwnerReference{{APIVersion: \"templates.gatekeeper.sh/v1beta1\", Kind: \"ConstraintTemplate\", Name: \"k8srequiredlabels\", UID: \"5efc2926-f9a6-11ea-9495-525500d15501\", Controller: &true, BlockOwnerDeletion: &true}}},
  \tSpec: v1beta1.CustomResourceDefinitionSpec{
  \t\t... // 5 identical fields
  \t\tSubresources:             &v1beta1.CustomResourceSubresources{Status: &v1beta1.CustomResourceSubresourceStatus{}},
  \t\tVersions:                 []v1beta1.CustomResourceDefinitionVersion{{Name: \"v1beta1\", Served: true, Storage: true}, {Name: \"v1alpha1\", Served: true}},
- \t\tAdditionalPrinterColumns: nil,
+ \t\tAdditionalPrinterColumns: []v1beta1.CustomResourceColumnDefinition{
+ \t\t\t{
+ \t\t\t\tName:        \"Age\",
+ \t\t\t\tType:        \"date\",
+ \t\t\t\tDescription: \"CreationTimestamp is a timestamp representing the server time when this object was created. It is not guaranteed to be set in happens-before order across separate operations. Clients may not set this value. It is represented in RFC3339 form and is in UTC.\
\
Populated by the system. Read-only. Null for lists. More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata\",
+ \t\t\t\tJSONPath:    \".metadata.creationTimestamp\",
+ \t\t\t},
+ \t\t},
- \t\tConversion:            s\"&CustomResourceConversion{Strategy:None,WebhookClientConfig:nil,ConversionReviewVersions:[],}\",
+ \t\tConversion:            nil,
- \t\tPreserveUnknownFields: &true,
+ \t\tPreserveUnknownFields: nil,
  \t},
  \tStatus: v1beta1.CustomResourceDefinitionStatus{Conditions: []v1beta1.CustomResourceDefinitionCondition{{Type: \"NamesAccepted\", Status: \"True\", LastTransitionTime: v1.Time{Time: s\"2020-09-18 11:59:14 +0000 UTC\"}, Reason: \"NoConflicts\", Message: \"no conflicts found\"}, {Type: \"Established\", Status: \"True\", Reason: \"InitialNamesAccepted\", Message: \"the initial names have been accepted\"}}, AcceptedNames: v1beta1.CustomResourceDefinitionNames{Plural: \"k8srequiredlabels\", Singular: \"k8srequiredlabels\", Kind: \"K8sRequiredLabels\", ListKind: \"K8sRequiredLabelsList\", Categories: []string{\"constraint\"}}, StoredVersions: []string{\"v1beta1\"}},
  }
"}

The timestamp one seems pretty tricky. In any case, I am not seeing an full loop. The related k8s version is pretty old and I bet that is the root cause (as opposed to being the platform openshift) so I am not sure how to treat this.

At the same time, I am curious to see why updates on the CRD are not triggering reconciliation loops. Will follow up on the other issue.

fedepaol avatar Sep 18 '20 12:09 fedepaol

Just to add that even in this case, despite the "updating" logs, the crd generation does not change, which makes me think that even in this case the updates are generated by status updates.

fedepaol avatar Sep 18 '20 15:09 fedepaol

generation is handled by the API server. I think it's possible for the API server to realize there is zero change after defaulting is applied and discard the updates, avoiding incrementing the generation but I'm not 100% sure.

Thank you for trying to reproduce the behavior. Unless @goyalv is able to reproduce this bug, maybe we should close this as obsolete?

maxsmythe avatar Sep 19 '20 02:09 maxsmythe

Hi, I tried to repro the issue on 4.5.14 openshift cluster and did not see the issue happening, did not see multiple updates. However I was able to repro it on openshift cluster 3.11 brought up using minishift

oc version oc v3.11.0+0cbc58b kubernetes v1.11.0+d4cacc0 features: Basic-Auth

Thanks for looking into this.

on openshift 3 cluster when I run "oc get validatingwebhookconfigurations" I only get gatekeeper-validating-webhook-configuration, deleting this webhook still causes constrainttemplates to continuously update.

I also did not see logs for CRD failure in gatekeeper pod logs. Unable to conclude whats causing this issue.

goyalv avatar Nov 17 '20 05:11 goyalv

It looks like openshift v.3.11 is based on k8s 1.11.x, where Gatkeeper lists a minimum K8s version of 1.14, if it works on OpenShifts based off newer clusters, is this a support gap we want to bridge? K8s 1.11.x is fairly old at this point.

maxsmythe avatar Nov 19 '20 23:11 maxsmythe

I ran into this as well.

k8s: 1.16.14 gatekeeper: 3.6.0

However, it turns out this was caused by us having 2 gatekeeper-audit pods running in different namespaces (we want to run integration-tests and non-functional-tests in the same cluster, but different namespaces).

As the constrainttemplate is global, both gatekeeper-audit pods received the 'add template' event, tried to sync their cache and pushed an updated template. The 2 pods get into a perptual cycle of receive event -> fetch template -> update template -> receive event -> fetch template. As soon as we stopped one of the audit deployments then the problem went away.

Maybe the original reporter was in a similar deployment situation?

mattburgess avatar Sep 23 '21 13:09 mattburgess

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 23 '22 03:07 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 11 '22 02:10 stale[bot]