linkerd2
linkerd2 copied to clipboard
Linkerd does not inject proxy containers with custom CNI on AWS
What is the issue?
Linkerd proxy injection does not work with custom CNI (cilium) on AWS EKS clusters.
How can it be reproduced?
Install cilium
helm list -n kube-system | grep cilium
cilium kube-system 4 2024-04-19 12:19:50.727550183 +0000 UTC deployed cilium-1.15.4 1.15.4
helm get values cilium -n kube-system
USER-SUPPLIED VALUES:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cni-plugin
operator: NotIn
values:
- aws
egressMasqueradeInterfaces: eth0
hubble:
enabled: true
relay:
enabled: true
ui:
enabled: true
ipam:
operator:
clusterPoolIPv4PodCIDRList:
- 10.0.0.0/8
Install linkerd
helm list -n linkerd
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
linkerd-control-plane linkerd 1 2024-04-22 14:00:07.29744245 +0000 UTC deployed linkerd-control-plane-2024.3.5 edge-24.3.5
linkerd-crds linkerd 5 2024-04-04 11:06:36.932480898 +0000 UTC deployed linkerd-crds-2024.3.5
helm get values linkerd-control-plane -n linkerd
USER-SUPPLIED VALUES:
disableHeartBeat: true
identity:
issuer:
scheme: kubernetes.io/tls
identityTrustAnchorsPEM: |-
-----BEGIN CERTIFICATE-----
$CERT_CONTENT
-----END CERTIFICATE-----
linkerdVersion: edge-24.3.5
policyController:
image:
name: my-artifactory/ghcr-docker-remote/linkerd/policy-controller
version: edge-24.3.5
profileValidator:
externalSecret: false
proxy:
image:
name: my-artifactory/ghcr-docker-remote/linkerd/proxy
version: edge-24.3.5
resources:
cpu:
limit: 100m
request: 50m
memory:
limit: 100Mi
request: 40Mi
proxyInit:
image:
name: my-artifactory/ghcr-docker-remote/linkerd/proxy-init
version: v2.2.4
runAsRoot: false
Annotate the namespace for automatic injection
apiVersion: v1
kind: Namespace
metadata:
annotations:
config.linkerd.io/proxy-await: enabled
linkerd.io/inject: enabled
...
Delete the pods
k get pod -n goldilocks
NAME READY STATUS RESTARTS AGE
goldilocks-controller-7869c48649-nqwkl 1/1 Running 0 50m
goldilocks-dashboard-75df58d594-49cj2 1/1 Running 0 50m
goldilocks-dashboard-75df58d594-zgw6v 1/1 Running 0 50m
user@ip-10-x-x-65 ~ $ k delete pod --all -n goldilocks
pod "goldilocks-controller-7869c48649-nqwkl" deleted
pod "goldilocks-dashboard-75df58d594-49cj2" deleted
pod "goldilocks-dashboard-75df58d594-zgw6v" deleted
user@ip-10-x-x-65 ~ $ k get pod -n goldilocks
NAME READY STATUS RESTARTS AGE
goldilocks-controller-7869c48649-vq5g2 1/1 Running 0 8s
goldilocks-dashboard-75df58d594-jdrnm 1/1 Running 0 6s
goldilocks-dashboard-75df58d594-ppxjm 1/1 Running 0 8s
Sidecar proxy should be injected and the last output should be
k get pod -n goldilocks
NAME READY STATUS RESTARTS AGE
goldilocks-controller-7869c48649-vq5g2 2/2 Running 0 8s
goldilocks-dashboard-75df58d594-jdrnm 2/2 Running 0 6s
goldilocks-dashboard-75df58d594-ppxjm 2/2 Running 0 8s
Logs, error output, etc
https://gist.github.com/gabbler97/6734dc908cf7136df49a8d2ba5e67eb9
output of linkerd check -o short
linkerd check -o short
linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
issuer certificate will expire on 2024-04-25T05:51:39Z
see https://linkerd.io/2.13/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
linkerd-version
---------------
‼ cli is up-to-date
is running version 2.13.4 but the latest stable version is 2.14.10
see https://linkerd.io/2.13/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 24.3.5 but the latest edge version is 24.4.4
see https://linkerd.io/2.13/checks/#l5d-version-control for hints
‼ control plane and cli versions match
control plane running edge-24.3.5 but cli running stable-2.13.4
see https://linkerd.io/2.13/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-c6595f85b-b9tlz (edge-24.3.5)
* linkerd-identity-6bfcf4bf97-cr8km (edge-24.3.5)
* linkerd-proxy-injector-59d7d485b-crbgj (edge-24.3.5)
see https://linkerd.io/2.13/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
linkerd-destination-c6595f85b-b9tlz running edge-24.3.5 but cli running stable-2.13.4
see https://linkerd.io/2.13/checks/#l5d-cp-proxy-cli-version for hints
linkerd-viz
-----------
‼ linkerd-viz pods are injected
could not find proxy container for metrics-api-5bd869c749-6vqmt pod
see https://linkerd.io/2.13/checks/#l5d-viz-pods-injection for hints
‼ viz extension pods are running
container "linkerd-proxy" in pod "metrics-api-5bd869c749-6vqmt" is not ready
see https://linkerd.io/2.13/checks/#l5d-viz-pods-running for hints
‼ viz extension proxies are healthy
no "linkerd-proxy" containers found in the "linkerd" namespace
see https://linkerd.io/2.13/checks/#l5d-viz-proxy-healthy for hints
Status check results are √
Environment
Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.11-eks-b9c9ed7
Possible solution
No response
Additional context
I have tried the linkerd-proxy-injector with hostNetwork=true. In this case the proxy sidecar containers are injected automatically after a deployment rollout. Some nodes became not ready because the kubelet stopped sending status. After a given time (10 minutes it has benn resolved automatically). My pods which are interacting with the kube API server started to crashloopbackoff, but only on one specific node at a time (where the linkerd proxy injector pod was running):
k get pod -A -o wide | grep "("
backup node-agent-dh6tm 0/1 CrashLoopBackOff 6 (43s ago) 10m 172.24.2.7 ip-10-x-x-162 <none> <none>
monitoring datadog-jwgx6 3/4 Running 5 (73s ago) 10m 172.24.2.83 ip-10-x-x-162 <none> <none>
storage-ebs ebs-csi-node-sfkx6 1/3 CrashLoopBackOff 8 (21s ago) 4m25s 172.24.2.163 ip-10-x-x-162 <none> <none>
storage-fsx fsx-openzfs-csi-node-2tkw7 1/3 CrashLoopBackOff 12 (71s ago) 9m25s 172.24.2.251 ip-10-x-x-162 <none> <none>
Inside the pod logs I have found timeout for api server requests
k logs ebs-csi-node-775zv -n storage-ebs
Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe
I0405 08:52:12.308665 1 driver.go:83] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.28.0"
I0405 08:52:12.308784 1 node.go:93] "regionFromSession Node service" region="eu-central-1"
I0405 08:52:12.308809 1 metadata.go:85] "retrieving instance data from ec2 metadata"
I0405 08:52:24.870306 1 metadata.go:88] "ec2 metadata is not available"
I0405 08:52:24.870333 1 metadata.go:96] "retrieving instance data from kubernetes api"
I0405 08:52:24.871040 1 metadata.go:101] "kubernetes api is available"
panic: error getting Node ip-10-x-x-77.eu-central-1.compute.internal: Get "https://172.20.0.1:443/api/v1/nodes/ip-10-x-x-77": dial tcp 172.20.0.1:443: i/o timeout
goroutine 1 [running]:
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.newNodeService(0xc00041cfc0)
/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/node.go:96 +0x3b1
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.NewDriver({0xc000477ec0, 0xd, 0x4?})
/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/driver.go:106 +0x3e6
main.main()
/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/cmd/main.go:64 +0x595
Would you like to work on fixing this bug?
None
Before attempting to use host networking, can you post the events (kubectl describe
) for the deployments (not the pods) after rolling them out to see if there's any info about why they didn't get injected? Also the events for the injector pod and its logs might prove to be useful.
Thank you for your answer @alpeb !
user@ip-10-x-x-65 ~ $ k logs linkerd-proxy-injector-55f86f4fc9-tsmgc -n linkerd
Defaulted container "linkerd-proxy" out of: linkerd-proxy, proxy-injector, linkerd-init (init)
[ 0.095648s] INFO ThreadId(01) linkerd2_proxy: release 2.224.0 (d91421a) by linkerd on 2024-03-28T18:07:05Z
[ 0.099989s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.101281s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[ 0.101298s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[ 0.101302s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.101305s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[ 0.101309s] INFO ThreadId(01) linkerd2_proxy: SNI is linkerd-proxy-injector.linkerd.serviceaccount.identity.linkerd.cluster.local
[ 0.101312s] INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-proxy-injector.linkerd.serviceaccount.identity.linkerd.cluster.local
[ 0.101315s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.104250s] INFO ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_pool_p2c: Adding endpoint addr=10.0.2.118:8090
[ 0.195414s] INFO ThreadId(01) dst:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}: linkerd_pool_p2c: Adding endpoint addr=10.0.2.118:8086
[ 0.202508s] INFO ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_pool_p2c: Adding endpoint addr=10.0.31.152:8080
[ 0.315761s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=linkerd-proxy-injector.linkerd.serviceaccount.identity.linkerd.cluster.local
user@ip-10-x-x-65 ~ $ k logs linkerd-proxy-injector-55f86f4fc9-tsmgc -n linkerd -c proxy-injector
time="2024-04-25T11:25:20Z" level=info msg="running version edge-24.3.5"
time="2024-04-25T11:25:20Z" level=info msg="starting admin server on :9995"
time="2024-04-25T11:25:20Z" level=info msg="waiting for caches to sync"
time="2024-04-25T11:25:20Z" level=info msg="listening at :8443"
time="2024-04-25T11:25:20Z" level=info msg="caches synced"
user@ip-10-x-x-65 ~ $ k logs linkerd-proxy-injector-55f86f4fc9-tsmgc -n linkerd -c linkerd-init
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy-save -t nat"
time="2024-04-25T11:25:12Z" level=info msg="# Generated by iptables-save v1.8.10 on Thu Apr 25 11:25:12 2024\n*nat\n:PREROUTING ACCEPT [0:0]\n:INPUT ACCEPT [0:0]\n:OUTPUT ACCEPT [0:0]\n:POSTROUTING ACCEPT [0:0]\nCOMMIT\n# Completed on Thu Apr 25 11:25:12 2024\n"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -N PROXY_INIT_REDIRECT"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_REDIRECT -p tcp --match multiport --dports 4190,4191,4567,4568 -j RETURN -m comment --comment proxy-init/ignore-port-4190,4191,4567,4568/1714044312"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_REDIRECT -p tcp -j REDIRECT --to-port 4143 -m comment --comment proxy-init/redirect-all-incoming-to-proxy-port/1714044312"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -A PREROUTING -j PROXY_INIT_REDIRECT -m comment --comment proxy-init/install-proxy-init-prerouting/1714044312"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -N PROXY_INIT_OUTPUT"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -m owner --uid-owner 2102 -j RETURN -m comment --comment proxy-init/ignore-proxy-user-id/1714044312"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -o lo -j RETURN -m comment --comment proxy-init/ignore-loopback/1714044312"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -p tcp --match multiport --dports 443,6443 -j RETURN -m comment --comment proxy-init/ignore-port-443,6443/1714044312"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -p tcp -j REDIRECT --to-port 4140 -m comment --comment proxy-init/redirect-all-outgoing-to-proxy-port/1714044312"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -A OUTPUT -j PROXY_INIT_OUTPUT -m comment --comment proxy-init/install-proxy-init-output/1714044312"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy-save -t nat"
time="2024-04-25T11:25:12Z" level=info msg="# Generated by iptables-save v1.8.10 on Thu Apr 25 11:25:12 2024\n*nat\n:PREROUTING ACCEPT [0:0]\n:INPUT ACCEPT [0:0]\n:OUTPUT ACCEPT [0:0]\n:POSTROUTING ACCEPT [0:0]\n:PROXY_INIT_OUTPUT - [0:0]\n:PROXY_INIT_REDIRECT - [0:0]\n-A PREROUTING -m comment --comment \"proxy-init/install-proxy-init-prerouting/1714044312\" -j PROXY_INIT_REDIRECT\n-A OUTPUT -m comment --comment \"proxy-init/install-proxy-init-output/1714044312\" -j PROXY_INIT_OUTPUT\n-A PROXY_INIT_OUTPUT -m owner --uid-owner 2102 -m comment --comment \"proxy-init/ignore-proxy-user-id/1714044312\" -j RETURN\n-A PROXY_INIT_OUTPUT -o lo -m comment --comment \"proxy-init/ignore-loopback/1714044312\" -j RETURN\n-A PROXY_INIT_OUTPUT -p tcp -m multiport --dports 443,6443 -m comment --comment \"proxy-init/ignore-port-443,6443/1714044312\" -j RETURN\n-A PROXY_INIT_OUTPUT -p tcp -m comment --comment \"proxy-init/redirect-all-outgoing-to-proxy-port/1714044312\" -j REDIRECT --to-ports 4140\n-A PROXY_INIT_REDIRECT -p tcp -m multiport --dports 4190,4191,4567,4568 -m comment --comment \"proxy-init/ignore-port-4190,4191,4567,4568/1714044312\" -j RETURN\n-A PROXY_INIT_REDIRECT -p tcp -m comment --comment \"proxy-init/redirect-all-incoming-to-proxy-port/1714044312\" -j REDIRECT --to-ports 4143\nCOMMIT\n# Completed on Thu Apr 25 11:25:12 2024\n"
And the events for the deployments
user@ip-10-x-x-65 ~ $ k describe deploy -n linkerd | grep Events
Events: <none>
Events: <none>
Events: <none>
user@ip-10-x-x-65 ~ $ k describe deploy -n goldilocks | grep Events
Events: <none>
Events: <none>
Also, can you post what you get from kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io linkerd-proxy-injector-webhook-config -oyaml
?
Yes of course!
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
annotations:
meta.helm.sh/release-name: linkerd-control-plane
meta.helm.sh/release-namespace: linkerd
labels:
app.kubernetes.io/managed-by: Helm
linkerd.io/control-plane-component: proxy-injector
linkerd.io/control-plane-ns: linkerd
name: linkerd-proxy-injector-webhook-config
webhooks:
- admissionReviewVersions:
- v1
- v1beta1
clientConfig:
caBundle: $CABUNDLE
service:
name: linkerd-proxy-injector
namespace: linkerd
path: /
port: 443
failurePolicy: Ignore
matchPolicy: Equivalent
name: linkerd-proxy-injector.linkerd.io
namespaceSelector:
matchExpressions:
- key: config.linkerd.io/admission-webhooks
operator: NotIn
values:
- disabled
- key: kubernetes.io/metadata.name
operator: NotIn
values:
- kube-system
- cert-manager
objectSelector:
matchExpressions:
- key: linkerd.io/control-plane-component
operator: DoesNotExist
- key: linkerd.io/cni-resource
operator: DoesNotExist
reinvocationPolicy: Never
rules:
- apiGroups:
- ""
apiVersions:
- v1
operations:
- CREATE
resources:
- pods
- services
scope: Namespaced
sideEffects: None
timeoutSeconds: 10
Any idea how should I continue? Thank you very much in advance!
Any clue? Thank you very much in advance!
Hi @gabbler97! Based on the output from linkerd check
, it seems that your control plane is not healthy. Looking more closely at the control plane logs, I do see a lot of failures from the control plane components to connect to each other. I'd suggest using Cilium's observability tools (such as Hubble) to ensure that Cilium is allowing traffic between the control plane components.
FWIW I've successfully tested linkerd with cilium in chained in hybrid mode with AWS VPC CNI, and it worked fine. Looking forward to what you find out about the control plane connectivity issues.
Thank you very much for your help! In the meantime I have found another way to avoid IPv4 exhaustion. In the future if somebody needs it it can be found here: https://aws.github.io/aws-eks-best-practices/networking/custom-networking/