cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

Add OpenAPI Validation across API types

Open killianmuldoon opened this issue 2 years ago • 18 comments

Currently validation for our APIs is done, for the most part, in our webhooks. OpenAPI validation could be added (possibly using kubebuilder tags) to make this validation part of our OpenAPI definition. Example validations are max / min, regex matching and enum enforcement - but there's a lot of possibilities enabled by kubebuilder tags.

We currently don't use this type of validation across our API types, but we do have some examples in MachineHealthCheck, MachinePool, ClusterResourceSet and our Bootstrap types.

Putting validation in our OpenAPI specs would move implementation of validation from our webhooks to the Kubernetes tooling (kubectl and/or kube-apiserver). We would still require webhooks for a number of validations which cannot be done with OpenAPI.

Impact From a UX perspective the change would be, in an ideal scenario, small. We wouldn't be able to customize error messages, but error messages would be more consistent. Users would be able to understand most of our validation by reading the OpenAPI spec instead of reading godoc comments (which may be out of sync with the actual implementation in the webhook).

In reality UX impact could be large and negative because of how defaulting currently works. In our current implementation webhook Defaulting happens before Validation. Users are able to leave fields blank, even though they are required in our validation, and have sane, valid defaults applied.

Currently we have the DefaultValidate test used across our API types which validates this process: https://github.com/kubernetes-sigs/cluster-api/blob/09d0824d9ca15760b53a6cfee66eb2131df46c4f/util/defaulting/defaulting.go#L37

This is used to test: Machines, MachineSets, MachineHealthChecks, MachineDeployments, ClusterResourceSets, KubeadmConfigs, KubeadmConfigTemplates, MachinePools, KubeadmControlPlane, KubeadmControlPlaneTemplate.

If we move validation to OpenAPI validation will happen before webhooks are called which would require changes in how we do defaulting and which fields we expect to be defined by users.

Defaulting could alternatively, in some cases, be done in the OpenAPI schema, but the process doesn't have a good UX right now. There's an issue open about improving this in Kubernetes at https://github.com/kubernetes/kubernetes/issues/108768

This can also cause rollouts in fields that previously weren't defaulted (https://github.com/kubernetes-sigs/cluster-api/pull/6095 h/t @sbueringer ) meaning that a large implementation and testing effort could be needed to ensure the changes aren't causing new rollouts.

If we end up moving only some validation and defaulting to our OpenAPI spec we end up with validation in multiple places. Which doesn't really improve the explicitness of our API but, in the worst case scenario, could be a large implementation effort and have a negative impact on UX.

Original discussion here: https://github.com/kubernetes-sigs/cluster-api/pull/6383#pullrequestreview-933775232

As the discussion is on an experimetal API we could try to implement OpenAPI validation on that API only to assess the impact and rolling it out to more parts of the overall CAPI API.

Should we try to move more validation to the OpenAPI spec? @fabriziopandini @JoelSpeed

/kind feature /area api

killianmuldoon avatar Apr 08 '22 10:04 killianmuldoon

I like the idea of being more explicit in the OpenAPI spec, but TBG I have some concerns about embracing kubebuilder tags for validation because

  • Our current test about defaulting, validation and conversion doesn't trigger open API validation, it this could expose us to errors to slip in as already happened in the past
  • We will end up managing two set of rules (the one implemented in the OpenAPI spec, the one implemented in webhooks)

Given the above reasons, I think we should not put this discussion in the critical path for #6383 and take some more time to figure out to preserve current test coverage while improving out OpenAPI spec

fabriziopandini avatar Apr 11 '22 10:04 fabriziopandini

Immutability is one important lack in current CRD OpenAPI validation. Also the fact that any conceptually required field that you want to default needs to be declared optional for client side validation to not complain e.g kubectl, is suboptimal and confusing.

enxebre avatar Apr 11 '22 11:04 enxebre

Also the fact that any conceptually required field that you want to default needs to be declared optional for client side validation to not complain e.g kubectl, is suboptimal and confusing.

A defaulted field is optional from a user perspective as they don't have to specify it, with open api a required field can be defaulted and no error occurs because defaulting happens client-side before hitting the api server

vincepri avatar Apr 11 '22 20:04 vincepri

A defaulted field is optional from a user perspective as they don't have to specify it, with open api a required field can be defaulted and no error occurs because defaulting happens client-side before hitting the api server

I just tested this on K8s 1.23 and I don't think that's correct. I added a required field with a default, left it out of the resource and got an error saying the field was required, if I make it optional, the default applies correctly. Seems the defaulting happens after the validation

JoelSpeed avatar Apr 12 '22 09:04 JoelSpeed

We would still require webhooks for a number of validations which cannot be done with OpenAPI.

Is it possible to enumerate a list of the type of validations we expect to carry over in webhooks, were we to make this change?

If we move validation to OpenAPI validation will happen before webhooks are called which would require changes in how we do defaulting and which fields we expect to be defined by users.

Can you elaborate on this point? Is there conditional defaulting (or other defaulting) happening that cannot be handled by OpenAPI? Keen to know where the gaps are in the use case for Cluster API

This can also cause rollouts in fields that previously weren't defaulted (https://github.com/kubernetes-sigs/cluster-api/pull/6095 h/t @sbueringer ) meaning that a large implementation and testing effort could be needed to ensure the changes aren't causing new rollouts.

Adding or changing the behaviour of a default for a field breaks the assumptions of clients writing and consuming this API, so it's a breaking change. As far as I understand, that doesn't change whether we have OpenAPI defaults/validation vs Webhook defaults/validation, does it?

JoelSpeed avatar Apr 12 '22 11:04 JoelSpeed

Is it possible to enumerate a list of the type of validations we expect to carry over in webhooks, were we to make this change?

I think so, but it's non-trivial IMO. We'd definitely want to do this well before we'd be updating the API.

Can you elaborate on this point? Is there conditional defaulting (or other defaulting) happening that cannot be handled by OpenAPI? Keen to know where the gaps are in the use case for Cluster API

One of the ones that comes to mind is where we copy a value during defaulting: e.g https://github.com/kubernetes-sigs/cluster-api/blob/c029dd2155692c79fc6109ecfa333324ace88765/api/v1beta1/machine_webhook.go#L53

There's also some places where we e.g. normalize versions: https://github.com/kubernetes-sigs/cluster-api/blob/e622cbb1d20f9e3f62bee0100f8a82d4b518f373/exp/api/v1beta1/machinepool_webhook.go#L70-L74

Would we be able to accomplish these with kubebuilder defaulting?

killianmuldoon avatar Apr 13 '22 16:04 killianmuldoon

Would we be able to accomplish these with kubebuilder defaulting?

Conditional stuff like this is out of scope of openapi defaulting

JoelSpeed avatar Apr 13 '22 16:04 JoelSpeed

with open api a required field can be defaulted and no error occurs because defaulting happens client-side before hitting the api server

@vincepri That's actually not true afaik, structural schema CRDs don't publish defaults so client-side can't know what it would be defaulted on the server. This results in a surprising UX for fields that are both required and defaulted failing client side validation if they are not set explicitly.

enxebre avatar Apr 13 '22 17:04 enxebre

A defaulted field is optional from a user perspective as they don't have to specify it, with open api a required field can be defaulted and no error occurs because defaulting happens client-side before hitting the api server

I just tested this on K8s 1.23 and I don't think that's correct. I added a required field with a default, left it out of the resource and got an error saying the field was required, if I make it optional, the default applies correctly. Seems the defaulting happens after the validation

Joel and further down Alberto are correct. kubectl only does OpenAPI validation but no defaulting. I opened an issue for this in k/k: https://github.com/kubernetes/kubernetes/issues/108768.

But in my opinion this is a mostly separate discussion. Independent of if we add more OpenAPI validation or not we will have optional and required fields like today and this will be an issue until it is resolved in k/k. Of course by adding more OpenAPI validation we could theoretically hit this case more often, but I think only in rare edge cases (where the defaulting webhook "fixes up" field values, e.g. adding the v prefix on versions). The most common case of this issue is optional vs required and for those cases we already have OpenAPI validation today on required fields.

So for me it comes down to:

  • adding more OpenAPI validation to the API types to make restrictions/validations more obvious when looking at the CRD / OpenAPI schema vs.
  • keeping it as is and basically doing almost all validations in validation webhooks (except some basic validations we get out of the box like optional/required)

From a maintainer perspective I definitely prefer the validation in code as:

  • it's just a lot more straightforward to unit test (I'm aware we could unit test OpenAPI validation via envtest, but it is slower)
  • it keeps the entire validation logic in one place and thus the entire validation logic is easier to comprehend (while of course the simple OpenAPI validations we could add are less obvious as folks have to look at the code)

From a user perspective it might be better to surface the validations we could express via OpenAPI schema in the OpenAPI schema of the CRDs directly, although this doesn't provide a full picture so it might be misleading. I wonder if having a good description would be better.

Is it possible to enumerate a list of the type of validations we expect to carry over in webhooks, were we to make this change?

A few examples of validations we cannot move to OpenAPI:

  • KCP validates that certain fields are immutable
  • ClusterClass does a lot of advanced defaulting/validation. tl;dr is you can define OpenAPI schemas for variables in a ClusterClass and they are then defaulted and validated in the Cluster webhook. Links:
    • https://cluster-api.sigs.k8s.io/tasks/experimental-features/cluster-class/write-clusterclass.html#clusterclass-with-patches
    • https://github.com/kubernetes-sigs/cluster-api/blob/7726885336063ce3e74df26e4e6f114643f0e5e9/internal/webhooks/clusterclass.go
  • Validations which check if the combination of values for multiple fields are correct: https://github.com/kubernetes-sigs/cluster-api/blob/ffa4348abc6ebc87924219bafffce9fa30981bc5/bootstrap/kubeadm/api/v1beta1/kubeadmconfig_webhook.go#L123-L154
  • Similar case: validations which ensure that the namespace in ownerRefs is the same as in the top-level resource
  • Validation which checks if a label selector is valid

I'm aware that there is some ongoing work on an expression language for validation in CRDs which is really nice, but this would only work on newer Kubernetes versions which we cannot assume for a very long time (link: https://github.com/kubernetes/enhancements/issues/2876)

I think there are a lot more cases and looking through our webhook code I'm not sure we could replace a significant percentage through OpenAPI validation.

sbueringer avatar Apr 26 '22 13:04 sbueringer

@sbueringer thanks for taking the time to think about this, that's a very thorough write up

I just wanted to add some extra context for completeness to your list of things we cannot move.

  • Immutability is on the backlog for CRDs https://github.com/kubernetes/kubernetes/pull/83743. I've spoken to a few of the API SMEs about this and the upshot was to wait for the SSA stuff to stabilise to avoid too many changes at once, but, in theory this feature is coming at some point

  • Validations which check if the combination of values for multiple fields are correct

    I believe this one can be solved with the CEL when that eventually lands right? (I know you've mentioned that this is a long time away but I wanted to add that context assuming I'm correct)

  • Similar case: validations which ensure that the namespace in ownerRefs is the same as in the top-level resource

    I don't think this is correct, there is no namespace field in an owner reference, that was a deliberate API design decision when the feature was introduced

JoelSpeed avatar Apr 27 '22 12:04 JoelSpeed

I believe this one can be solved with the CEL when that eventually lands right? (I know you've mentioned that this is a long time away but I wanted to add that context assuming I'm correct)

I think you're correct, a lot of of those should be implementable with CEL, not sure if we have something which is too complex for that.

I don't think this is correct, there is no namespace field in an owner reference, that was a deliberate API design decision when the feature was introduced

My bad, I meant ObjectReference instead of OwnerRefs (e.g. https://github.com/kubernetes-sigs/cluster-api/blob/main/api/v1beta1/clusterclass_types.go#L445-L449 but also at least in MD, MS, Cluster). This might be solved by stop using the not-recommended ObjectReference upstream type and using our own one with only the useful fields instead. In cases were namespace must be always the same as the one of the top-level API type it doesn't make sense to have it configurable. (xref: https://github.com/kubernetes-sigs/cluster-api/pull/6024)

sbueringer avatar Apr 27 '22 14:04 sbueringer

This might be solved by stop using the not-recommended ObjectReference upstream type and using our own one with only the useful fields instead.

Ahh this makes a lot more sense now 😅 We should definitely do that in the next API version

JoelSpeed avatar Apr 27 '22 14:04 JoelSpeed

I stand corrected :) Thanks folks for digging through these and getting all the information together!

vincepri avatar Apr 27 '22 15:04 vincepri

We should take a look at how/if ServerSideValidation changes the situation: https://github.com/kubernetes/kubernetes/pull/108889

sbueringer avatar May 30 '22 17:05 sbueringer

Hey there, wg-api-expression lead here, we're generally interested in building more expressive APIs,

Just read this thread, a few comments:

  • The whole discussion about required/defaulted fields. The default field in openapi is "informational", it tells you what value the server will use if no value is specified, that sounds entirely incompatible with a required field. Also it's not an indication that clients should default themselves to that value. It's easy to fix though, just don't make the field required if it has a default (we could forbid that in the apiserver).
  • Immutability, this will be doable entirely with CEL, we're writing a blog post to document the different use-cases. I agree it's not doable until CEL is enabled.
  • CEL for various things that are too complicated for OpenAPI. Again, not ready yet :-/

We're generally trying to make sure people don't have to maintain a validating webhook, so I'd love to hear the use-cases that you can't get rid of.

apelisse avatar Aug 11 '22 16:08 apelisse

We're generally trying to make sure people don't have to maintain a validating webhook, so I'd love to hear the use-cases that you can't get rid of.

The trickiest cases we have rely on the existence of objects, or of fields in other objects. When we create a cluster with ClusterClass field specified we run a get from the API server to ensure the ClusterClass exists, that it has some required fields and that variable definitions in the Cluster meet some of the contstraints in the ClusterClass.

Seeing as it relies on the spec of another object it seems like this would be hard to do without a webhook or something very expressive.

killianmuldoon avatar Aug 15 '22 11:08 killianmuldoon

Also commented on the linked k/k issue, but just to mention it here. I think this statement is not correct The default field in openapi is "informational". The APIserver first runs defaulting (based on the default fields) then validation on the server-side while kubectl only runs validation.

Thus we get inconsistent results on client- and server-side

sbueringer avatar Aug 15 '22 16:08 sbueringer

Agreed, but I think it's wrong that fields are both required and defaulted, we can't fix that in the api now, but we should strongly recommend not having these.

apelisse avatar Aug 15 '22 17:08 apelisse

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 13 '22 17:11 k8s-triage-robot

/lifecycle frozen

fabriziopandini avatar Nov 14 '22 08:11 fabriziopandini

/triage accepted

fabriziopandini avatar Nov 30 '22 20:11 fabriziopandini

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot avatar Jan 19 '24 12:01 k8s-triage-robot

/priority important-longterm

fabriziopandini avatar Apr 12 '24 14:04 fabriziopandini

The Cluster API project currently lacks enough active contributors to adequately respond to all issues and PRs.

@sbueringer summarized our options in https://github.com/kubernetes-sigs/cluster-api/issues/7634

  1. All validation should be done in webhooks (with only very few exceptions like: optional/required, maybe enum)
  2. Validation should be done as much as possible in OpenAPI, everything else in webhooks

I think that given the current status of the project/number of maintainers, and the fact that no one is raising concerns about the expressivity of our OpenAPI schema, only 1 is viable because it is basically the current state and we can keep improving stuff incrementally in a surgical way.

I there won't be comments, I will close this issue and proceed as described above

fabriziopandini avatar Apr 24 '24 13:04 fabriziopandini