linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

LinkerD does not work with weave-net cni on eks cluster

Open TechnoChimp opened this issue 4 years ago • 6 comments

Bug Report

What is the issue?

When LinkerD is installed on an EKS cluster which is using weave-net cni instead of the aws cni, linkerd is not able to inject proxies into deployments, and the tap does not work.

How can it be reproduced?

1. Deploy new EKS cluster @ version 1.15
eksctl create cluster

2. Delete AWS CNI
kubectl delete ds aws-node -n kube-system

3. Deploy Weave Net
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"

4. Initiate deployment of new worker nodes

5. Deploy Linkerd
linkerd install | kubectl apply -f -

6. Deploy test app with proxy injection enabled
linkerd inject https://run.linkerd.io/emojivoto.yml | kubectl apply -f -

Logs, error output, etc

E0505 18:40:13.084785       1 controller.go:114] loading OpenAPI spec for "v1alpha1.tap.linkerd.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
, Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]

18:40:13
I0505 18:40:13.084791 1 controller.go:127] OpenAPI AggregationController: action for item v1alpha1.tap.linkerd.io: Rate Limited Requeue.
I0505 18:40:13.084791       1 controller.go:127] OpenAPI AggregationController: action for item v1alpha1.tap.linkerd.io: Rate Limited Requeue.

18:40:41
E0505 18:40:41.431426 1 available_controller.go:409] v1alpha1.tap.linkerd.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1alpha1.tap.linkerd.io": the object has been modified; please apply your changes to the latest version and try again
E0505 18:40:41.431426       1 available_controller.go:409] v1alpha1.tap.linkerd.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1alpha1.tap.linkerd.io": the object has been modified; please apply your changes to the latest version and try again

18:40:41
E0505 18:40:41.431843 1 available_controller.go:409] v1alpha1.tap.linkerd.io failed with: failing or missing response from https://10.32.0.3:8089/apis/tap.linkerd.io/v1alpha1: Get https://10.32.0.3:8089/apis/tap.linkerd.io/v1alpha1: Address is not allowed
E0505 18:40:41.431843       1 available_controller.go:409] v1alpha1.tap.linkerd.io failed with: failing or missing response from https://10.32.0.3:8089/apis/tap.linkerd.io/v1alpha1: Get https://10.32.0.3:8089/apis/tap.linkerd.io/v1alpha1: Address is not allowed

linkerd check output

linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust roots are using supported crypto algorithm
√ trust roots are within their validity period
√ trust roots are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust root

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
‼ tap api service is running
    FailedDiscoveryCheck: failing or missing response from https://10.38.0.6:8089/apis/tap.linkerd.io/v1alpha1: Get https://10.38.0.6:8089/apis/tap.linkerd.io/v1alpha1: Address is not allowed
    see https://linkerd.io/checks/#l5d-tap-api for hints

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

Status check results are √

Environment

  • Kubernetes Version: v1.15.11-eks-af3caf
  • Cluster Environment: EKS
  • Host OS: amazon-eks-node-1.15-v20200423 (ami-026522559b4f79cc8)
  • Linkerd version: 2.7.1

Possible solution

It seems the master nodes are unable to communicate with the linkerd api server since these nodes don't participate in the weavenet overlay network. I've seen other technologies workaround this by using the hostNetwork flag on their deployments. I tried testing this with the LInkerD deployment, but only bad things happened.

Additional context

Master nodes are on a 10.200.x.x network, linkerd is on the 10.32.x.x weavenet default network.

TechnoChimp avatar May 05 '20 19:05 TechnoChimp

Tap will also not work. In fact, most k8s projects either using webhooks or the aggregation layer won't work (but ymmv).

It is possible to use an URL instead of the service reference we're using now. The best solution might be to use nodePort in the proxy-injector's service. You could then either front all your EKS nodes with a load balancer pointing to that nodePort or have a subdomain with all your node's IP addresses in it. I would recommend using the kustomize method to modify your installation.

PS. I would recommend bringing this up with the EKS team as it is a limitation on their side of things.

grampelberg avatar May 05 '20 20:05 grampelberg

If we did want to go the route of using a load balancer pointing to a nodePort, where in the config would we tell the master nodes what the new URL would be?

Also, we have a case open with AWS support. So far their response is... just use the AWS CNI.

TechnoChimp avatar May 11 '20 16:05 TechnoChimp

You'd change the MWC to use an IP address and change the proxy injector service. There's no linkerd side config for this, so you'll be editing the YAML directly.

grampelberg avatar May 11 '20 17:05 grampelberg

Would it be possible to document all the linkerd webhooks somewhere? From what I found patching only the proxy-injector is not enough e.g. it also needs the sp-validator.

Fantaztig avatar Jun 13 '22 11:06 Fantaztig

@Fantaztig Grepping the templates is probably the best way for now. E.g.

:; linkerd install --ignore-cluster | grep WebhookConfig -A2
kind: ValidatingWebhookConfiguration
metadata:
  name: linkerd-sp-validator-webhook-config
--
kind: ValidatingWebhookConfiguration
metadata:
  name: linkerd-policy-validator-webhook-config
--
kind: MutatingWebhookConfiguration
metadata:
  name: linkerd-proxy-injector-webhook-config

We'd accept a website PR if you'd like to see this there as well.

olix0r avatar Jun 14 '22 16:06 olix0r

We're also hitting this issue when trying to build up new clusters with EKS and Cilium CNI. Given that it's unlikely that Amazon will improve the situation I'm trying to find a solution which is at least reliable.

Simon Rajan on the Linkerd slack suggested ingress as a solution:

k8s api server is not able to reach the endpoint linkerd-proxy-injector.linkerd.svc which is defined inside mutatingwebhookconfiguration while using calico as the CNI. I don't think its an issue with Linkerd.. Its an issue with EKS running with Calico CNI As a workaround you can create an ingress endpoint for linkerd-proxy-injector.linkerd.svc and update the mutatingwebhookconfiguration linkerd-proxy-injector-webhook-config to use clientConfig as url rather than service endpoint. Like this https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#url Mutating webhook kubectl get mutatingwebhookconfiguration linkerd-proxy-injector-webhook-config -o yaml

For discoverability there is also this closed issue: https://github.com/linkerd/linkerd2/issues/5576

cablespaghetti avatar Aug 16 '23 15:08 cablespaghetti