controller-runtime icon indicating copy to clipboard operation
controller-runtime copied to clipboard

🐛 Fix custom defaulter: avoid deleting unknown fields (zero change patch)

Open trasc opened this issue 1 year ago • 9 comments

When creating a patch for the custom defaulter, don't return the removals generated by object decoding with the local scheme.

Otherwise, the patch will include a remove instruction for fields that the go type doesn't contain.

This could happen when building a webhook for a resource that belongs to another project (including k8s types).

This PR add a new option DefaulterPreserveUnknownFields that can be used with WithCustomDefaulter to stop the defaulter from pruning the fields that are not recognized in the local scheme.

When doing this we are creating two patches one before and one after the custom defaulter call and only send back the remove operations that are not present in both patches.

(This is a rework of #2931)

trasc avatar Oct 14 '24 07:10 trasc

/cc @alculquicondor /cc @sbueringer /cc @alvaroaleman

trasc avatar Oct 14 '24 08:10 trasc

For context to c-r maintainers, we have been discussing what alternatives we have for a generic defaulter that doesn't drop unknown fields. Here is the thread for additional context https://github.com/kubernetes-sigs/kueue/pull/3194#issuecomment-2405174799

Taking @mimowo's words, we would be happy to first merge a solution in kueue and let it mature there before moving it here. But we would appreciate your input to the approaches.

alculquicondor avatar Oct 15 '24 15:10 alculquicondor

@alculquicondor @trasc @mimowo We (Stefan, Vince and I) had some discussion around https://github.com/kubernetes-sigs/controller-runtime/pull/2932 vs this and we landed at this here being better but we aren't 100% sure if its compatible in all cases and currently don't have time to spend on that question, which is why we would like an opt-out for what is suggested here (but on by default).

Does that sound reasonable to you all?

alvaroaleman avatar Oct 23 '24 16:10 alvaroaleman

@alculquicondor @trasc @mimowo We (Stefan, Vince and I) had some discussion around #2932 vs this and we landed at this here being better but we aren't 100% sure if its compatible in all cases and currently don't have time to spend on that question, which is why we would like an opt-out for what is suggested here (but on by default).

Does that sound reasonable to you all?

Sure, I'll ping you when I have something in that direction.

trasc avatar Oct 23 '24 16:10 trasc

Thanks for considering the options!

Do you have any particular guideline on how the opt-out mechanism could look like? Should we have a different method in the factory? Or should the method WithDefaulter accept some options? Any other suggestion?

Other than that, I hope the overhead of the approach is not significant. Fortunately, a new encoding is coming to k/k CRDs https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4222-cbor-serializer#motivation

alculquicondor avatar Oct 23 '24 19:10 alculquicondor

I think

... should the method WithDefaulter accept some options ...

it leave room to maybe add an additional option that removes the add zero values.

trasc avatar Oct 24 '24 08:10 trasc

@alvaroaleman please have a look at my DefaulterPreserveUnknownFields proposal when you have some time.

Also please let me know if we can do something about the pull-controller-runtime-apidiff check,

changed from 
func(*k8s.io/apimachinery/pkg/runtime.Scheme, k8s.io/apimachinery/pkg/runtime.Object, CustomDefaulter) *Webhook to 
func(*k8s.io/apimachinery/pkg/runtime.Scheme, k8s.io/apimachinery/pkg/runtime.Object, CustomDefaulter, ...defaulterOption) *Webhook

is a flase positive in my opinion, as you don't need to actually do something to old code in order to use the new version.

trasc avatar Oct 24 '24 09:10 trasc

/lgtm

alculquicondor avatar Oct 25 '24 18:10 alculquicondor

LGTM label has been added.

Git tree hash: dc8f633f86d5467de1824324a9004cf5c10d3d61

k8s-ci-robot avatar Oct 25 '24 18:10 k8s-ci-robot

LGTM label has been added.

Git tree hash: ad6c906a29abd3648f1ceeefac007955a4d3b86a

k8s-ci-robot avatar Oct 29 '24 02:10 k8s-ci-robot

I'll take a look soon, just low bandwidth at the moment

sbueringer avatar Oct 31 '24 18:10 sbueringer

In general okay. One comment about the option name and documentation

sbueringer avatar Nov 01 '24 14:11 sbueringer

Thank you!!

/lgtm

@alvaroaleman Keeping the hold in case you want to take a final look after the last change

sbueringer avatar Nov 04 '24 12:11 sbueringer

LGTM label has been added.

Git tree hash: f8f793a6960067cffea8b08e87822f9d14246b85

k8s-ci-robot avatar Nov 04 '24 12:11 k8s-ci-robot

PR needs rebase.

LGTM modulo the rebase

alvaroaleman avatar Nov 05 '24 19:11 alvaroaleman

Rebased

trasc avatar Nov 05 '24 20:11 trasc

@trasc: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-controller-runtime-apidiff fb7b6cd415ce2bc7cf004f4d9246ec2696f88465 link false /test pull-controller-runtime-apidiff

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot avatar Nov 05 '24 20:11 k8s-ci-robot

Thx!

/lgtm /approve /hold cancel

sbueringer avatar Nov 05 '24 20:11 sbueringer

LGTM label has been added.

Git tree hash: 7302a762ff601db57984c8fa33072a6e68119c8f

k8s-ci-robot avatar Nov 05 '24 20:11 k8s-ci-robot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, sbueringer, trasc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • ~~OWNERS~~ [alvaroaleman,sbueringer]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Nov 05 '24 20:11 k8s-ci-robot

@trasc @alculquicondor Hey folks,

just an early heads up already. I bumped Cluster API to CR main branch and I'm now getting panics from the webhook: https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api/11633/pull-cluster-api-e2e-blocking-main/1874860029216034816/artifacts/clusters/bootstrap/logs/capd-system/capd-controller-manager/capd-controller-manager-5b59b9b76-4kmz2/manager.log

Looks like jsonpatch.Operation is not always hashable

E0102 17:01:30.424172       1 <autogenerated>:1] "Observed a panic" logger="admission" webhookGroup="infrastructure.cluster.x-k8s.io" webhookKind="DockerClusterTemplate" DockerClusterTemplate="quick-start-zwk5pp/quick-start-cluster" namespace="quick-start-zwk5pp" name="quick-start-cluster" resource={"group":"infrastructure.cluster.x-k8s.io","version":"v1beta1","resource":"dockerclustertemplates"} user="kubernetes-admin" requestID="39810ae0-c339-4d68-a8cf-c007f0215a5b" panic="runtime error: hash of unhashable type map[string]interface {}" panicGoValue="\"hash of unhashable type map[string]interface {}\"" stacktrace=<
	goroutine 515 [running]:
	k8s.io/apimachinery/pkg/util/runtime.logPanic({0x259abe0, 0xc000822690}, {0x1ee0d80, 0xc00086cb10})
		k8s.io/[email protected]/pkg/util/runtime/runtime.go:107 +0xbc
	sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Handle.func1()
		sigs.k8s.io/[email protected]/pkg/webhook/admission/webhook.go:163 +0x103
	panic({0x1ee0d80?, 0xc00086cb10?})
		runtime/panic.go:785 +0x132
	type:.hash.gomodules.xyz/jsonpatch/v2.Operation(0xc0007ea7c8, 0x5aa77e?)
		<autogenerated>:1 +0x45
	k8s.io/apimachinery/pkg/util/sets.Set[...].Has(...)
		k8s.io/[email protected]/pkg/util/sets/set.go:78
	sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*defaulterForType).dropSchemeRemovals.func3({{0x22158db, 0x3}, {0xc0003d34d0, 0x28}, {0x1ebeba0, 0xc000aa9500}})
		sigs.k8s.io/[email protected]/pkg/webhook/admission/defaulter_custom.go:158 +0x65
	slices.IndexFunc[...](...)
		slices/slices.go:109
	slices.DeleteFunc[...]({0xc000740a80?, 0x8, 0x8}, 0xc0007eab68)
		slices/slices.go:237 +0xe3
	sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*defaulterForType).dropSchemeRemovals(_, {{0xc000740a80, 0x8, 0x8}, {{0x0, 0x0}, 0x1, 0x0, {0x0, 0x0, ...}, ...}}, ...)
		sigs.k8s.io/[email protected]/pkg/webhook/admission/defaulter_custom.go:157 +0x4ac
	sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*defaulterForType).Handle(_, {_, _}, {{{0xc0003d2ab0, 0x24}, {{0xc000e0bc20, 0x1f}, {0xc000a75510, 0x7}, {0xc000c10780, ...}}, ...}})
		sigs.k8s.io/[email protected]/pkg/webhook/admission/defaulter_custom.go:129 +0x42e

sbueringer avatar Jan 02 '25 17:01 sbueringer

Not sure what the most efficient way is to compare jsonpatch.Operations (the straightforward option would be to call the Json method and compare that instead

sbueringer avatar Jan 02 '25 17:01 sbueringer

I don't think we need to compare the whole thing, just the combination of operation and path. I would be surprised if jsonpatch.CreatePatch generates more than one entry for a path, even.

Here is the type, for reference https://github.com/gomodules/jsonpatch/blob/e2d6410c0299d49b797482dc765a3b49ecc8a4b6/v3/jsonpatch.go#L17-L21

Should we revert for now?

cc @mbobrovskyi @mimowo

alculquicondor avatar Jan 02 '25 18:01 alculquicondor

IIUC, the JSON Patch remove operation shouldn't include a value. I'm curious about the specific request needed to reproduce this scenario.

mbobrovskyi avatar Jan 03 '25 05:01 mbobrovskyi

I think I've identified the problem. We're comparing all patches, which makes it possible to encounter a non-hashable value.

Created a fix https://github.com/kubernetes-sigs/controller-runtime/pull/3056.

mbobrovskyi avatar Jan 03 '25 07:01 mbobrovskyi

Should we revert for now?

I think we're fine with the fix (#3056 should cover it)

I think I've identified the problem. We're comparing all patches, which makes it possible to encounter a non-hashable value.

Looks perfect, thank you!

sbueringer avatar Jan 03 '25 11:01 sbueringer

Quick feedback. The panic is fixed 🎉

I'm just hitting another issue in Cluster API. Not sure yet if that is something that we should care about in CR

API field:

	// imageTag allows to specify a tag for the image.
	// +optional
	ImageTag string `json:"imageTag,omitempty"`

Through a combination of unexpected behavior, we're trying to set this field to null via one of our CAPI controllers:

  "local" : {
    "imageTag" : null
  }

Before this change: our defaulting/mutating webhook generated this patch operation:

Operation: remove
Path: /spec/kubeadmConfigSpec/clusterConfiguration/dns/imageTag

=> The result was that the field was removed before CR validation

After this change: this patch operation is dropped by dropSchemeRemovals => The result is that imageTag stays null => Because the field is not nullable we end up with

failed to create KubeadmControlPlane.controlplane.cluster.x-k8s.io:
        FieldValueTypeInvalid: spec.kubeadmConfigSpec.clusterConfiguration.etcd.local.imageTag:
        Invalid value: "null": spec.kubeadmConfigSpec.clusterConfiguration.etcd.local.imageTag in body must be of type string

sbueringer avatar Jan 03 '25 12:01 sbueringer

It looks like WAI, from the name of the defaulter option removeUnknownOrOmitableFields.

It's a bug in the CAPI controller that happened to be masked by the webhook implementation details. I wonder why the CAPI controller sets null, tho. Doesn't it use the go types that are marked with omitempty?

alculquicondor avatar Jan 03 '25 14:01 alculquicondor

I agree. The behavior of the defaulter is exactly as expected.

CAPI has a controller that has to work with unstructured and in some cases it sets the field (unintentionally) to nil instead of an empty string.

I'm only wondering how many folks we break that are in a similar situation as CAPI. CAPI itself is not a problem, I can deal with this there by setting the option. I think I have to set it because we can't make a change like this within an API version in CAPI (there are many tools/products built on top of CAPI and they shouldn't suddenly start failing with issues like this). If I don't set the option the behavior of our API changes. Values that have been (unintentionally) valid before are suddenly invalid.

sbueringer avatar Jan 04 '25 08:01 sbueringer

We did make the change opt-out. I think it's reasonable to make it opt-in for a few releases, have a deprecation note and eventually remove it.

alculquicondor avatar Jan 06 '25 14:01 alculquicondor