AKS icon indicating copy to clipboard operation
AKS copied to clipboard

[Feedback] Calico state that api version v1 is not recommended and that it's unsupported, but AKS does not install v3

Open MarkTopping opened this issue 1 year ago • 10 comments

Describe your scenario The AKS installation of Calico Network Policy is forcing customers to use a reportedly incorrect and unsupported version of Calico resources. AKS customers can only deploy v1 resources whereas Tigera/Calico state that consumers should be using v3.

If users lookup how to create NetworkPolicies or other Calico resources in the Calico documentation they will find that examples are provided with apiVersion: projectcalico.org/v3. If however we attempt to deploy such a resource into an AKS cluster then we get the following error:

no matches for kind "xxx" in version "crd.projectcalico.org/v3" ensure CRDs are installed first

To overcome this, AKS users have to deploy Calico objects using apiVersion: crd.projectcalico.org/v1 instead.

A while back, one of the ProjectCalico team members posted the following discussion on their GitHub: https://github.com/projectcalico/calico/issues/6412

Within this GitHub issue it states:

"Don't touch crd.projectcalico.org/v1 resources. They are not currently supported for end-users and the entire API group is only used internally within Calico. Using any API within that group means you will bypass API validation and defaulting, which is bad and can result in symptoms like # 2 above. You should use projectcalico.org/v3 instead"

In recent days I sought their clarification since AKS does not make v3 available and the same message was conveyed that v3 should be being used and the respondent was surprised AKS does not called the Calico API Server which would apparently install the v3 CRDs.

Feedback It would be very much appreciated if Microsoft could liaise with Tigera/ProjectCalico to confirm their recommendations and what exactly is supported or not and then ensure that AKS consumers who enable Calico Network Policy are not left in a position where are unable to use the correct resource types.

Thank you

MarkTopping avatar Dec 18 '23 13:12 MarkTopping

situation remains the same. not stale

MarkTopping avatar Mar 31 '24 11:03 MarkTopping

Experiencing the same, manual install of the Calico API server fails with unable to load configmap based request-header-client-ca-file .

Enabled Calico as network policy engine on a AKS cluster Calico cluster version: 3.24.6

maur1 avatar Apr 24 '24 09:04 maur1

situation remains the same. not stale

MarkTopping avatar May 29 '24 09:05 MarkTopping

Still should not be closed. Perhaps an 'ignored' tag would be more appropriate :-(

MarkTopping avatar Jun 19 '24 11:06 MarkTopping

Not stale

aslafy-z avatar Jul 10 '24 20:07 aslafy-z

Not stale.

rvaccarim avatar Aug 02 '24 19:08 rvaccarim

Any updates on this? Thanks in advance.

rvaccarim avatar Aug 16 '24 11:08 rvaccarim

Any updates on this? Thanks in advance.

rvaccarim avatar Sep 02 '24 13:09 rvaccarim

@paulgmiller would you be able to take a look at this.

chasewilson avatar Sep 23 '24 16:09 chasewilson

any update?

joeybdub avatar Sep 24 '24 13:09 joeybdub

any idea when this feature will be added?

joeybdub avatar Oct 03 '24 08:10 joeybdub

Hi folk, my initial stanice on this is that the aks calico addon is only supported for by us (aks) for kuberntees network policies (the ones in networking.k8s.io/v1)

https://kubernetes.io/docs/concepts/services-networking/network-policies/

That is all we test against so its not clear that enabling the v3 apis would actually lead anyone down a good path here. We definitely have a documentation gap here and ideally should have blocked projectcalico.org/v1 years ago but that was before my time.

Would be interested in what different customers are using the calico apis for (I am guessing global policies) but I can't make any promises that we'll turn on v3 at this time.

paulgmiller avatar Oct 03 '24 16:10 paulgmiller

Well its not just GlobalNetworkPolicies there are GlobalNetworkSet also. @paulgmiller If v3 is not enabled in AKS in means anyone using calico either has to use the outdated v1(not supported) or change there network policy configuration to use Azure or Cilium for the foreseeable future?

joeybdub avatar Oct 03 '24 17:10 joeybdub

For me, I feel that it makes little sense for Microsoft to offer Calico as a Network Policy provision if you could then only use networking.k8s.io/v1. That diminishes any value in choosing Calico as far as I can see. If you blocked Calico resources I also feel you'd be making it harder for customers to secure their clusters - please do not do that.

I personally used Calico OSS for a good year or so (deploying v1 policies supported or not) and functionally it did it's job exactly as expected. What we were missing by not having v3 available was the validation that brings. With v1 policies no validation is performed on what you deploy and this can lead to problems which v3 might help you avoid.

@paulgmiller In answer to the question though about what customers are using the Calico APIs; you are spot on with GlobalNetworkPolicies. This will likely be the most prominent reason for using Calico so that you can apply things like a global default deny rule. Additionally, as with Joey, I made heavy use of [Global]NetworkSets since these are a great way of abstracting IP addresses in a self-documenting way. Customers who use these objects will then most likely also use the Calico implementation of a NetworkPolicy resource over the k8s native one. Most other resource types and the additional capability afforded in the Calico NetworkPolicy objects (like DNS based policies) are not available in the OSS version so I doubt the list would grow beyond these 4 primary resource types.

@joeybdub I don't believe v1 are outdated - I know the version number would lead you to think as much. In the commercial product your v3 policies are converted to v1 after the validation is performed. I'm not clear on the reasons why. It's just Tigera state that v1 resources are meant to be an internal concern and not used directly by customers I as far as I understand it.

MarkTopping avatar Oct 03 '24 20:10 MarkTopping

@paulgmiller Given that the documentation explicitly mentions as Other features:

Extended policy model consisting of Global Network Policy, Global Network Set, and Host Endpoint

Image

I think it would be very misleading if it were suddenly not supported and explicit removal of these features would impact tons of customers already using these for quite a long time now.

I realize that this might be a bit more work on the AKS side, but the best way forward would be to properly offer the features that Calico OSS comes with, and therefore include the Calico API Server and deploy the proper CRD.

jemag avatar Oct 18 '24 20:10 jemag

Just wanted to quickly mention as well that these extended features of Calico have been mentioned as far back as the original announcement about Calico support: https://azure.microsoft.com/en-us/blog/integrating-azure-cni-and-calico-a-technical-deep-dive/ . In contrast, I cannot find any mention anywhere that would suggest these Calico features would not be supported.

As for what customers use it for, it is not only for Global Network Policy, Global Network Set, and Host Endpoint. Calico's policies also have significant advantages over regular network policies such as Ordering and support for both Allow and Deny actions.

In my opinion, supporting Calico for network policies without supporting any of their features (e.g. only regular kubernetes network policies) would be nonsensical.

jemag avatar Oct 18 '24 22:10 jemag

I want to take a moment to clarify some of the terminology to ensure we are aligned on the specifics. 1. “Support”: When we say AKS doesn’t “support” a feature, this doesn’t mean it’s impossible to use. Specifically, we do not and are not planning to explicitly block Calico policies. As you pointed out, these policies are integral to many customer environments. What we do mean is that if an issue arises with policies outside the scope of Kubernetes network policies, Microsoft Support would not be able to assist. We do not “officially” support Calico functionalities beyond basic network policies and cannot guarantee their stability on AKS, although that doesn’t mean they won’t work. 2. Improved Functionality: We recognize the need for more advanced networking capabilities, which is why we are focusing on expanding these features with Cilium. While Calico was originally intended for basic network policies, we are actively developing enhanced capabilities using Cilium. This shift allows us to offer better performance, scalability, and deeper integration with AKS via eBPF technology, supported by an internal team at Microsoft. Unfortunately, the decision to use Calico in its current scope predates my involvement, but we are committed to meeting the growing demands of our customers. 3. Timelines: In the short-to-medium term, Calico will continue to be available and supported within its existing boundaries. We have no plans to disrupt current usage suddenly. However, we encourage customers to start evaluating our Cilium offering, which we believe provides significant improvements in performance and scalability, along with a more seamless integration with AKS.

We’d also love to hear more about which specific Calico features you depend on. Gathering this feedback will help us prioritize our work and ensure our advanced networking capabilities meet your needs. Our goal is to ensure AKS continues to improve, with the confidence that your networking stack is both stable and fully supported.

Please feel free to reach out with any additional questions or feature requests. I’d be happy to create tracking items for you to follow and contribute to as we evolve the platform.

chasewilson avatar Oct 19 '24 03:10 chasewilson

  1. “Support”: When we say AKS doesn’t “support” a feature, this doesn’t mean it’s impossible to use. Specifically, we do not and are not planning to explicitly block Calico policies. As you pointed out, these policies are integral to many customer environments. What we do mean is that if an issue arises with policies outside the scope of Kubernetes network policies, Microsoft Support would not be able to assist. We do not “officially” support Calico functionalities beyond basic network policies and cannot guarantee their stability on AKS, although that doesn’t mean they won’t work

@chasewilson I think there is 2 issues here:

  1. As mentioned, both the original announcement and the documentation mention these extended features. Customers are not privy to the internal workings of the AKS team. If truly these features are not supported, it should then be clearly and explicitly mentioned what the boundaries of that support are.
  2. It is fair for support to not be able to help with Calico network policies or custom resources. However, as was mentioned, customers are already using these features and, at the very least, there should be a minimum guarantee that AKS Calico integration is stable and will not break these features. If you cannot guarantee their stability, what that means is that any day an update to the current Calico setup could break thousands of clusters across your various customers. This would not be okay.

In my opinion, I see only 2 reasonable options moving forward:

A- (Recommended) Make a a clear statement in the documentation deprecating explicit support for any Calico custom resources, while ensuring the minimum stability that AKS Calico implementation will not break these features (and therefore existing customers). In my mind, this would also mean providing the API server resource and proper CRD definition. B- Make an announcement and provide a transition path and period from Azure CNI + AKS managed Calico to Azure CNI + self-managed Calico OSS for network policies. This would mean ensuring that Azure CNI can still be used with Calico for network policies and there is an easy path for customers to set it up and maintain the install themselves over time.

I would much prefer option A, as this would keep the status quo and what customers were led to believe was the current state of affairs. It would also prevent breaking current customer setups, arduous transitory period and allow customers to keep benefiting from Calico's mature network policies model.


In regards to Calico features that we depend on, here are some of them for us:

  • NetworkSets and GlobalNetworkSets
    • If you have countless rules targeting specific group of CIDRs and these CIDRs change, with GlobalNetworkSets, these can be changed easily in a central location. Otherwise this can quickly become an unmaintainable mess. It also helps when you have group of CIDRs that represent specify concepts that only change from 1 environment to the other (again easy centralization of change).
  • GlobalNetworkPolicy
    • Easily setup policies across several namespaces
  • Explicit Deny and Allow action
    • It can sometimes be much easier to deny a very specific network flow than allow a large list of them
  • Policy Order
    • Ordering can be very important when using both Deny and Allow actions. For example, you may have a sub-section/CIDR of traffic being explicitly denied and need to allow a specific exception within that. With ordering you can "override" that deny with this specific exception. It can also help when sharing responsibility of network policies across cluster administrators and development teams (e.g. : cluster administrators could have permissions over Calico network policies and development team over regular Kubernetes network policies. Administrators can then create policies that can either be overridden by Kubernetes network policies (high order), or not overridable (low order))
  • notNets
    • Makes it possible to exclude specific CIDRs from being allowed
  • Seamless Wireguard encryption across nodes:
    • Enabling encryption between nodes is as simple as changing the parameter wireguardEnabled from false to true. This provides a very seamless and easy experience.

Unless Cilium can also provide all of the above features as well as a straightforward transition path, we cannot really consider it.

jemag avatar Oct 19 '24 15:10 jemag

I did a quick test with a K8s 1.28 (I used this version because that is what I had at hand) cluster and was able to add this resource to my cluster and it resulted in the calico-apiserver being deployed.

apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
  name: default
spec: {}

AKS has loaded the calico-apiserver images to the registry they configure so the tigera/operator in the cluster can respond to the addition of the APIServer resource and deploy the API server which then makes the Calico v3 API available.

I hope this helps anyone what is looking to use the Calico v3 api instead of the v1 which as previously stated should only be used directly by Calico components or with the calicoctl command.

tmjd avatar Oct 22 '24 12:10 tmjd

Talking with some tigera people about this. AKS still doesn't want to over extend AKS to claim we support something when all we really test is k8s netpol.

paulgmiller avatar Jan 22 '25 19:01 paulgmiller

@paulgmiller I think it is fair to clarify the level at which Calico is being tested in this GitHub issue. However, the current documentation and original blog posts lead in the other direction, implying that these features are supported.

I think, in the best case scenario, this is either indeed tested by the AKS team OR an agreement is made such that AKS team relies on Tigera for that part of the testing. My understanding is that Tigera already provide their enterprise product to be available on top of Azure CNI, so it is in their best interest to keep these Calico features working.

On our side, we're now simply deploying ourselves the Tigera operator with APIServer on top of Azure CNI, but we would much rather go back to Azure's Calico integration, should it properly support the Calico's APIServer and other features out of the box.

jemag avatar Jan 23 '25 02:01 jemag

Will you include the Calico API Server in next releases? We have installed the calico apiserver following the steps from Calico and it pulls the image from mcr.microsoft.com/oss/calico/apiserver:v3.28.3

kriskron avatar Apr 30 '25 16:04 kriskron

latest image is not found in mcr getting imagePullBackOff for mcr.microsoft.com/mirror/quay/calico/apiserver:v3.29.3

andrewkreuzer avatar May 08 '25 14:05 andrewkreuzer

@paulgmiller could you help get the apiserver image added to the mcr.microsoft.com/mirror for folks that want to add the APIserver? Ideally it will be added to the process of mirroring the Calico images.

tmjd avatar May 09 '25 20:05 tmjd

thanks @tmjd there is now an issue specific to the image not being found #4999

andrewkreuzer avatar May 09 '25 20:05 andrewkreuzer

I did a manual test of api server in 1.32 and 1.30 and api server pods came up but thats a manual one off test and there is no guard to continue to make sure it works in our deployments.

If there was a more serious issue than a missing image we can't make any promises about fixing it in a timely fashion. Your supported options are still the calico we deploy with out api server. I recognize that doesn't include a bunch of features but those are features we never indtended to support and would directl you to work with tigera enterprise/byo cni if you need them and can't find them here.

paulgmiller avatar May 21 '25 16:05 paulgmiller

I recognize that doesn't include a bunch of features but those are features we never indtended to support

As mentioned in previous posts, this is unfortunately not the way it was communicated by Azure in the past.

would directl you to work with tigera enterprise/byo cni if you need them and can't find them here.

You do not need tigera enterprise nor BYO CNI to use those features.


I would have preferred to have Azure support (since this is fairly straightforward), but at this point I would just recommend other users to go the Azure CNI + no network policy and then install tigera-operator helm chart directly (https://artifacthub.io/packages/helm/projectcalico/tigera-operator).

This is a pretty straightforward install. We have been running this setup for months across multiple clusters and the experience has been quite stable (perhaps more so than what is currently supported by Azure).

You will also get all the advanced Calico features, which now includes with version 3.30, a free observability UI for network flows: https://www.tigera.io/blog/introducing-calico-3-30-a-new-era-of-open-source-network-security-and-observability-for-kubernetes/

Image

I still think this should be supported out of the box as part of the Calico integration, given how large the gap is between the features offered currently vs with the full install. Nevertheless, in the meantime users can go with the self managed tigera-operator install like we did.

jemag avatar May 21 '25 21:05 jemag

For AKS on Azure Local it seems like Calico is the only supported configuration, based on my proof-of-concept testing "no network policy" is not even available.

(I realize AKS on Azure Local / ARC is perhaps internally far away from regular AKS - but I had the same questions about crd.projectcalico.org/v3 and the features released with Calico 3.30 and found this thread.)

filleokus avatar Jun 11 '25 17:06 filleokus