linkerd2
linkerd2 copied to clipboard
LinkerD does not work with weave-net cni on eks cluster
Bug Report
What is the issue?
When LinkerD is installed on an EKS cluster which is using weave-net cni instead of the aws cni, linkerd is not able to inject proxies into deployments, and the tap does not work.
How can it be reproduced?
1. Deploy new EKS cluster @ version 1.15
eksctl create cluster
2. Delete AWS CNI
kubectl delete ds aws-node -n kube-system
3. Deploy Weave Net
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
4. Initiate deployment of new worker nodes
5. Deploy Linkerd
linkerd install | kubectl apply -f -
6. Deploy test app with proxy injection enabled
linkerd inject https://run.linkerd.io/emojivoto.yml | kubectl apply -f -
Logs, error output, etc
E0505 18:40:13.084785 1 controller.go:114] loading OpenAPI spec for "v1alpha1.tap.linkerd.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
, Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
18:40:13
I0505 18:40:13.084791 1 controller.go:127] OpenAPI AggregationController: action for item v1alpha1.tap.linkerd.io: Rate Limited Requeue.
I0505 18:40:13.084791 1 controller.go:127] OpenAPI AggregationController: action for item v1alpha1.tap.linkerd.io: Rate Limited Requeue.
18:40:41
E0505 18:40:41.431426 1 available_controller.go:409] v1alpha1.tap.linkerd.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1alpha1.tap.linkerd.io": the object has been modified; please apply your changes to the latest version and try again
E0505 18:40:41.431426 1 available_controller.go:409] v1alpha1.tap.linkerd.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1alpha1.tap.linkerd.io": the object has been modified; please apply your changes to the latest version and try again
18:40:41
E0505 18:40:41.431843 1 available_controller.go:409] v1alpha1.tap.linkerd.io failed with: failing or missing response from https://10.32.0.3:8089/apis/tap.linkerd.io/v1alpha1: Get https://10.32.0.3:8089/apis/tap.linkerd.io/v1alpha1: Address is not allowed
E0505 18:40:41.431843 1 available_controller.go:409] v1alpha1.tap.linkerd.io failed with: failing or missing response from https://10.32.0.3:8089/apis/tap.linkerd.io/v1alpha1: Get https://10.32.0.3:8089/apis/tap.linkerd.io/v1alpha1: Address is not allowed
linkerd check
output
linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust roots are using supported crypto algorithm
√ trust roots are within their validity period
√ trust roots are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust root
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
‼ tap api service is running
FailedDiscoveryCheck: failing or missing response from https://10.38.0.6:8089/apis/tap.linkerd.io/v1alpha1: Get https://10.38.0.6:8089/apis/tap.linkerd.io/v1alpha1: Address is not allowed
see https://linkerd.io/checks/#l5d-tap-api for hints
linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date
control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match
Status check results are √
Environment
- Kubernetes Version: v1.15.11-eks-af3caf
- Cluster Environment: EKS
- Host OS: amazon-eks-node-1.15-v20200423 (ami-026522559b4f79cc8)
- Linkerd version: 2.7.1
Possible solution
It seems the master nodes are unable to communicate with the linkerd api server since these nodes don't participate in the weavenet overlay network. I've seen other technologies workaround this by using the hostNetwork flag on their deployments. I tried testing this with the LInkerD deployment, but only bad things happened.
Additional context
Master nodes are on a 10.200.x.x network, linkerd is on the 10.32.x.x weavenet default network.
Tap will also not work. In fact, most k8s projects either using webhooks or the aggregation layer won't work (but ymmv).
It is possible to use an URL instead of the service reference we're using now. The best solution might be to use nodePort
in the proxy-injector's service. You could then either front all your EKS nodes with a load balancer pointing to that nodePort or have a subdomain with all your node's IP addresses in it. I would recommend using the kustomize method to modify your installation.
PS. I would recommend bringing this up with the EKS team as it is a limitation on their side of things.
If we did want to go the route of using a load balancer pointing to a nodePort, where in the config would we tell the master nodes what the new URL would be?
Also, we have a case open with AWS support. So far their response is... just use the AWS CNI.
You'd change the MWC to use an IP address and change the proxy injector service. There's no linkerd side config for this, so you'll be editing the YAML directly.
Would it be possible to document all the linkerd webhooks somewhere? From what I found patching only the proxy-injector is not enough e.g. it also needs the sp-validator.
@Fantaztig Grepping the templates is probably the best way for now. E.g.
:; linkerd install --ignore-cluster | grep WebhookConfig -A2
kind: ValidatingWebhookConfiguration
metadata:
name: linkerd-sp-validator-webhook-config
--
kind: ValidatingWebhookConfiguration
metadata:
name: linkerd-policy-validator-webhook-config
--
kind: MutatingWebhookConfiguration
metadata:
name: linkerd-proxy-injector-webhook-config
We'd accept a website PR if you'd like to see this there as well.
We're also hitting this issue when trying to build up new clusters with EKS and Cilium CNI. Given that it's unlikely that Amazon will improve the situation I'm trying to find a solution which is at least reliable.
Simon Rajan on the Linkerd slack suggested ingress as a solution:
k8s api server is not able to reach the endpoint linkerd-proxy-injector.linkerd.svc which is defined inside mutatingwebhookconfiguration while using calico as the CNI. I don't think its an issue with Linkerd.. Its an issue with EKS running with Calico CNI As a workaround you can create an ingress endpoint for linkerd-proxy-injector.linkerd.svc and update the mutatingwebhookconfiguration linkerd-proxy-injector-webhook-config to use clientConfig as url rather than service endpoint. Like this https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#url Mutating webhook
kubectl get mutatingwebhookconfiguration linkerd-proxy-injector-webhook-config -o yaml
For discoverability there is also this closed issue: https://github.com/linkerd/linkerd2/issues/5576