calico Discussion: crd.projectcalico.org/v1 vs projectcalico.org/v3

This issue comes up frequently enough that I think it warrants its own parent issue to explain and discuss. I'll try to keep this up-to-date with the latest thinking and status should it change.

The problem generally manifests itself as one of the following:

no matches for kind "X" in version "projectcalico.org/v3" when attempting to apply a resource.
Applying a resource with apiVersion: crd.projectcalico.org/v1 and Calico not behaving as expected.

TL;DR

Don't touch crd.projectcalico.org/v1 resources. They are not currently supported for end-users and the entire API group is only used internally within Calico. Using any API within that group means you will bypass API validation and defaulting, which is bad and can result in symptoms like # 2 above. You should use projectcalico.org/v3 instead. Note that projectcalico.org/v3 requires that you install the Calico API server in your cluster, and will result in errors similar to # 1 above if the Calico API server is not running.

Ok, but why do it that way?

Well, it's partly because of limitations in CRDs, and partly due to historical reasons. CRDs provide some validation out of the box on their own, but can't do some of the more complex cross-field and cross-object API validation that the Calico API server can perform. For example, making sure that IP pools are consistent with the IPAM block resources within the cluster is a complex validation process that just can't be expressed in an OpenAPI schema. Same goes for some of the defaulting operations (e.g., conditional defaulting based on other fields).

As a result, Calico uses an aggregation API server to perform these complex tasks against projectcalico.org/v3 APIs, and stores the resulting validated and defaulted resources in the "backend" as CRDs within the crd.projectcalico.org/v1 group. Prior to the introduction of said API server, all of that validation and defaulting had to be performed client-side via calicoctl, but data was still stored in the "backend" as CRDs, for Calico itself to consume.

CRD validation has come a long way since Calico initially started using them way back in beta when they were actually called ThirdPartyResources. However, they still don't (and probably won't ever) support the types of validation that Calico currently enforces via its API server.

Pain points

Yes, this model is not perfect and has a few known (non-trivial) pain points that I would love to resolve.

Having two APIs that look very similar (but in fact have a few small, key differences) is very confusing for users.
Relying on an aggregation layer to serve the APIs introduces a more complex ordering dependency, since the Calico API server requires Calico in order to run, meaning projectcalico.org/v3 isn't usable until after the API server image is pulled and running on the cluster. This can cause issues for GitOps tools like Flux, which don't have a great way of understanding that multi-layer dependency.
Aggregate API servers are not commonly used in other projects, whereas CRDs are, causing additional confusion.

Can we make it better?

Maybe. I hope so! But the solutions are not simple. We'd need to do at least some combination or the following, based on my current best guesses.

Add validation and defaulting to crd.projectcalico.org/v1 and make it supported. We can't do this without introducing a webhook, which is not really desirable. We can do maybe 25% of our validation via CRD server-side validation, but we'd be losing a lot of our current validation and defaulting if we go this route.
Allow the API server to run as host network, prior to installing Calico. This solves some problems, but not all of them (still have two APIs that are confusing and users still need to install a pod on the cluster to use our API).
Remove crd.projectcalico.org/v1 altogether, instead back the aggregation layer in Kubernetes' etcd instance. This solves the "two APIs" problem, but would be a rather cumbersome data migration project and doesn't remove the need for a Calico-specific API server.
Introduce a new API group (or groups) - e.g., policy.projectcalico.org/v3 and write all new CRD-based APIs within it, with a focus on making the syntax and semantics 100% compatible with what CRD validation and defaulting provides.

Jul 21 '22 20:07 caseydavenport

Cross-referencing an older, tangential discussion: https://github.com/projectcalico/calico/issues/2923

Jul 21 '22 20:07 caseydavenport

We can't do this without introducing a webhook, which is not really desirable

Why is a webhook less desirable than an apiserver? By putting the validation logic in a webhook, that would remove the need for the apiservice (assuming defaulting could be done in CRD)

Jul 26 '22 19:07 muff1nman

Why is a webhook less desirable than an apiserver?

It's not that it's less desirable, per-se, it's mostly just that it suffers from many of the same problems as a separate apiserver does - i.e., running another pod on the cluster that needs its own networking, etc, in order to provide defaulting and validation, rather than performing that within the Kubernetes API server natively.

Jul 26 '22 19:07 caseydavenport

The nice thing about a validating webhook is that it has a builtin toggle for skirting around it when one is in the early stage of installing calico (assuming the webhook ran in the pod network). However, my bet is that a webhook with hostNetwork would be pretty reasonable (shouldn't need to ever have it in failurePolicy=Ignore and should have very little downsides compared to other routes considered with the big benefit of having one api group. It also provides a fairly straight path to the last option:

making the syntax and semantics 100% compatible with what CRD validation and defaulting provides

Jul 27 '22 21:07 muff1nman

We'd like to extend an already mentioned pain point: Requirement for calico / networking to work, before k8s api server can serve projectcalico.org/v3:

Because the CRD is registered using a k8s service ip, this essentially also requires that routing k8s service cidr works properly on k8s controller nodes. This is in particular difficult if you are running isolated k8s controller nodes and rely on BGP & calico to announce the k8s service cidr routes to the k8s controller nodes.

During our experiments we could end up in a situation where no k8s service routes where present (anymore), and k8s controller node / api server component not being able to properly setup projectcalico.org/v3 because the calico api server service ip was not reachable, the k8s api server then failed because of that

Nov 02 '22 23:11 simplysoft