🐛 Fix custom defaulter: avoid deleting unknown fields (zero change patch)
When creating a patch for the custom defaulter, don't return the removals generated by object decoding with the local scheme.
Otherwise, the patch will include a remove instruction for fields that the go type doesn't contain.
This could happen when building a webhook for a resource that belongs to another project (including k8s types).
This PR add a new option DefaulterPreserveUnknownFields that can be used with WithCustomDefaulter to stop the defaulter from pruning the fields that are not recognized in the local scheme.
When doing this we are creating two patches one before and one after the custom defaulter call and only send back the remove operations that are not present in both patches.
(This is a rework of #2931)
/cc @alculquicondor /cc @sbueringer /cc @alvaroaleman
For context to c-r maintainers, we have been discussing what alternatives we have for a generic defaulter that doesn't drop unknown fields. Here is the thread for additional context https://github.com/kubernetes-sigs/kueue/pull/3194#issuecomment-2405174799
Taking @mimowo's words, we would be happy to first merge a solution in kueue and let it mature there before moving it here. But we would appreciate your input to the approaches.
@alculquicondor @trasc @mimowo We (Stefan, Vince and I) had some discussion around https://github.com/kubernetes-sigs/controller-runtime/pull/2932 vs this and we landed at this here being better but we aren't 100% sure if its compatible in all cases and currently don't have time to spend on that question, which is why we would like an opt-out for what is suggested here (but on by default).
Does that sound reasonable to you all?
@alculquicondor @trasc @mimowo We (Stefan, Vince and I) had some discussion around #2932 vs this and we landed at this here being better but we aren't 100% sure if its compatible in all cases and currently don't have time to spend on that question, which is why we would like an opt-out for what is suggested here (but on by default).
Does that sound reasonable to you all?
Sure, I'll ping you when I have something in that direction.
Thanks for considering the options!
Do you have any particular guideline on how the opt-out mechanism could look like? Should we have a different method in the factory? Or should the method WithDefaulter accept some options? Any other suggestion?
Other than that, I hope the overhead of the approach is not significant. Fortunately, a new encoding is coming to k/k CRDs https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4222-cbor-serializer#motivation
I think
... should the method
WithDefaulteraccept some options ...
it leave room to maybe add an additional option that removes the add zero values.
@alvaroaleman please have a look at my DefaulterPreserveUnknownFields proposal when you have some time.
Also please let me know if we can do something about the pull-controller-runtime-apidiff check,
changed from
func(*k8s.io/apimachinery/pkg/runtime.Scheme, k8s.io/apimachinery/pkg/runtime.Object, CustomDefaulter) *Webhook to
func(*k8s.io/apimachinery/pkg/runtime.Scheme, k8s.io/apimachinery/pkg/runtime.Object, CustomDefaulter, ...defaulterOption) *Webhook
is a flase positive in my opinion, as you don't need to actually do something to old code in order to use the new version.
/lgtm
LGTM label has been added.
LGTM label has been added.
I'll take a look soon, just low bandwidth at the moment
In general okay. One comment about the option name and documentation
Thank you!!
/lgtm
@alvaroaleman Keeping the hold in case you want to take a final look after the last change
LGTM label has been added.
PR needs rebase.
LGTM modulo the rebase
Rebased
@trasc: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-controller-runtime-apidiff | fb7b6cd415ce2bc7cf004f4d9246ec2696f88465 | link | false | /test pull-controller-runtime-apidiff |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
Thx!
/lgtm /approve /hold cancel
LGTM label has been added.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: alvaroaleman, sbueringer, trasc
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [alvaroaleman,sbueringer]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@trasc @alculquicondor Hey folks,
just an early heads up already. I bumped Cluster API to CR main branch and I'm now getting panics from the webhook: https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api/11633/pull-cluster-api-e2e-blocking-main/1874860029216034816/artifacts/clusters/bootstrap/logs/capd-system/capd-controller-manager/capd-controller-manager-5b59b9b76-4kmz2/manager.log
Looks like jsonpatch.Operation is not always hashable
E0102 17:01:30.424172 1 <autogenerated>:1] "Observed a panic" logger="admission" webhookGroup="infrastructure.cluster.x-k8s.io" webhookKind="DockerClusterTemplate" DockerClusterTemplate="quick-start-zwk5pp/quick-start-cluster" namespace="quick-start-zwk5pp" name="quick-start-cluster" resource={"group":"infrastructure.cluster.x-k8s.io","version":"v1beta1","resource":"dockerclustertemplates"} user="kubernetes-admin" requestID="39810ae0-c339-4d68-a8cf-c007f0215a5b" panic="runtime error: hash of unhashable type map[string]interface {}" panicGoValue="\"hash of unhashable type map[string]interface {}\"" stacktrace=<
goroutine 515 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x259abe0, 0xc000822690}, {0x1ee0d80, 0xc00086cb10})
k8s.io/[email protected]/pkg/util/runtime/runtime.go:107 +0xbc
sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Handle.func1()
sigs.k8s.io/[email protected]/pkg/webhook/admission/webhook.go:163 +0x103
panic({0x1ee0d80?, 0xc00086cb10?})
runtime/panic.go:785 +0x132
type:.hash.gomodules.xyz/jsonpatch/v2.Operation(0xc0007ea7c8, 0x5aa77e?)
<autogenerated>:1 +0x45
k8s.io/apimachinery/pkg/util/sets.Set[...].Has(...)
k8s.io/[email protected]/pkg/util/sets/set.go:78
sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*defaulterForType).dropSchemeRemovals.func3({{0x22158db, 0x3}, {0xc0003d34d0, 0x28}, {0x1ebeba0, 0xc000aa9500}})
sigs.k8s.io/[email protected]/pkg/webhook/admission/defaulter_custom.go:158 +0x65
slices.IndexFunc[...](...)
slices/slices.go:109
slices.DeleteFunc[...]({0xc000740a80?, 0x8, 0x8}, 0xc0007eab68)
slices/slices.go:237 +0xe3
sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*defaulterForType).dropSchemeRemovals(_, {{0xc000740a80, 0x8, 0x8}, {{0x0, 0x0}, 0x1, 0x0, {0x0, 0x0, ...}, ...}}, ...)
sigs.k8s.io/[email protected]/pkg/webhook/admission/defaulter_custom.go:157 +0x4ac
sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*defaulterForType).Handle(_, {_, _}, {{{0xc0003d2ab0, 0x24}, {{0xc000e0bc20, 0x1f}, {0xc000a75510, 0x7}, {0xc000c10780, ...}}, ...}})
sigs.k8s.io/[email protected]/pkg/webhook/admission/defaulter_custom.go:129 +0x42e
Not sure what the most efficient way is to compare jsonpatch.Operations (the straightforward option would be to call the Json method and compare that instead
I don't think we need to compare the whole thing, just the combination of operation and path.
I would be surprised if jsonpatch.CreatePatch generates more than one entry for a path, even.
Here is the type, for reference https://github.com/gomodules/jsonpatch/blob/e2d6410c0299d49b797482dc765a3b49ecc8a4b6/v3/jsonpatch.go#L17-L21
Should we revert for now?
cc @mbobrovskyi @mimowo
IIUC, the JSON Patch remove operation shouldn't include a value. I'm curious about the specific request needed to reproduce this scenario.
I think I've identified the problem. We're comparing all patches, which makes it possible to encounter a non-hashable value.
Created a fix https://github.com/kubernetes-sigs/controller-runtime/pull/3056.
Should we revert for now?
I think we're fine with the fix (#3056 should cover it)
I think I've identified the problem. We're comparing all patches, which makes it possible to encounter a non-hashable value.
Looks perfect, thank you!
Quick feedback. The panic is fixed 🎉
I'm just hitting another issue in Cluster API. Not sure yet if that is something that we should care about in CR
API field:
// imageTag allows to specify a tag for the image.
// +optional
ImageTag string `json:"imageTag,omitempty"`
Through a combination of unexpected behavior, we're trying to set this field to null via one of our CAPI controllers:
"local" : {
"imageTag" : null
}
Before this change: our defaulting/mutating webhook generated this patch operation:
Operation: remove
Path: /spec/kubeadmConfigSpec/clusterConfiguration/dns/imageTag
=> The result was that the field was removed before CR validation
After this change: this patch operation is dropped by dropSchemeRemovals => The result is that imageTag stays null => Because the field is not nullable we end up with
failed to create KubeadmControlPlane.controlplane.cluster.x-k8s.io:
FieldValueTypeInvalid: spec.kubeadmConfigSpec.clusterConfiguration.etcd.local.imageTag:
Invalid value: "null": spec.kubeadmConfigSpec.clusterConfiguration.etcd.local.imageTag in body must be of type string
It looks like WAI, from the name of the defaulter option removeUnknownOrOmitableFields.
It's a bug in the CAPI controller that happened to be masked by the webhook implementation details.
I wonder why the CAPI controller sets null, tho. Doesn't it use the go types that are marked with omitempty?
I agree. The behavior of the defaulter is exactly as expected.
CAPI has a controller that has to work with unstructured and in some cases it sets the field (unintentionally) to nil instead of an empty string.
I'm only wondering how many folks we break that are in a similar situation as CAPI. CAPI itself is not a problem, I can deal with this there by setting the option. I think I have to set it because we can't make a change like this within an API version in CAPI (there are many tools/products built on top of CAPI and they shouldn't suddenly start failing with issues like this). If I don't set the option the behavior of our API changes. Values that have been (unintentionally) valid before are suddenly invalid.
We did make the change opt-out. I think it's reasonable to make it opt-in for a few releases, have a deprecation note and eventually remove it.