linkerd2
linkerd2 copied to clipboard
CockroachDB cannot be meshed (Cluster-communication breaks)
What is the issue?
Hello :)
I deploy linkerd 2.13.5 in HA-mode
linkerd install --crds ...
linkerd install --ha ...
linkerd viz install --ha ...
Then I set a namespace to policy "all-authenticated":
kubectl annotate namespace development config.linkerd.io/default-inbound-policy=all-authenticated
Then I deploy CockroachDB-Cluster via helm chart with default values:
helm upgrade --install cockroachdb cockroachdb/cockroachdb --version 11.1.3 --namespace development
CockroachDB-Cluster works fine afterwards.
Then I try to perform linkerd injection:
kubectl -n development get sts cockroachdb -o yaml | linkerd inject - | kubectl apply -f -
Rollout process get stuck because first restarted pod does not become ready, so I manually restart the other pods.
But even after all pods have been restarted and contain linkerd init- and sidecar-containers, CockroachDB-Cluster does not work anymore - nodes cannot reach each other:
E230801 17:25:57.509642 927 2@rpc/context.go:2404 ⋮ [T1,n1,rnode=2,raddr=‹cockroachdb-2.cockroachdb.development.svc.cluster.local:26257›,class=default,rpc] 108 unable to connect (is the peer up and reachable?): initial connection heartbeat failed: grpc: ‹connection error: desc = "transport: authentication handshake failed: EOF"› [code 14/Unavailable]
linkerd-proxy sidecar container does not log anything related to port 26257. (It only logs unauthoized connection attempts from Prometheus to port 8080, which is true, but unrelated to the cockroachdb-cluster-communication issue.)
Also
linkerd viz tap -n development sts/cockroachdb
does not show anything related to port 26257.
I don't know how to further debug this issue.
I have tried to set annotation
config.linkerd.io/opaque-ports: 26257,8080
but this did not change anything.
k8s cluster is EKS cluster 1.26.6
Can anybody give me a hint on how to further debug this issue?
Thanks in advance :)
How can it be reproduced?
Set namespace to policy "all-authenticated":
kubectl annotate namespace development config.linkerd.io/default-inbound-policy=all-authenticated
Deploy CockroachDB-Cluster via helm chart with default values:
helm upgrade --install cockroachdb cockroachdb/cockroachdb --version 11.1.3 --namespace development
Perform linkerd injection:
kubectl -n development get sts cockroachdb -o yaml | linkerd inject - | kubectl apply -f -
As rollout process gets stuck, delete the remaining two pods so that all three pods run with linkerd sidecar.
CockroachDB pods remain unhealthy, end up in crash loop. CockroachDB pods log "unable to connect (is the peer up and reachable?)"
Logs, error output, etc
No error logs from linkerd as far as I can see. No logs regarding blocked packets related to cluster-port 26257
output of linkerd check -o short
$ linkerd check -o short
Status check results are √
Environment
AWS EKS: v1.26.7-eks-2d98532
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
If memory serves, we've seen issues like this in the past with CockroachDB due to its use of an initContainer
which must communicate over the network as part of the startup process. Because initContainer
s run before the Linkerd proxy, they are unmeshed, and an all-authenticated
policy will deny traffic from those init containers, because their traffic cannot be authenticated (as there is no Linkerd proxy performing mTLS on their behalf yet).
I believe that the native sidecar containers entering beta in Kubernetes 1.29 may resolve issues like this, since they could be used to allow Linkerd proxies to start up before other initContainer
s run, allowing initContainer
traffic to be meshed. Linkerd added support for native sidecar containers in today's edge release, edge-23.11.4
(see PR #11465), so this issue may be fixed for edge-23.11.4
running on Kubernetes 1.28. It's also possible that additional steps are necessary in order to use this new, beta functionality in Kubernetes to resolve this issue --- perhaps @alpeb knows more about this?
I'm not very familiar with the past cockroachDB issues, but by inspecting this example I see the sts' init container is just a shell command that doesn't hit the network. OTOH when sts pods are rolled out, they need to keep on connecting to one another in order to become ready: e.g. cockroachdb-0 is the first to be rolled out and gets injected while cockroachdb-1 and cockroachdb-2 remain uninjected, but cockroachdb-0 can't receive connections from the others because of policy, so it doesn't fully start and the sts rollout process gets stuck.
What you need to do here is add the all-authenticated
policy to the namespace only after the sts has been injected.
Thank you very much for reply :)
@alpeb I was already aware of this issue and manually deleted the two remaining pods so that all three run with linkerd sidecar container. They still cannot talk to each other. Also, NO policy violation is logged. Please re-read my origional description for more details.
I still believe this is an actual bug. If you want me to re-test with latest version of linkerd or do some more analysis, just tell me what to do :)
I have also gained some experience with issues related to other init containers that are created via webhook injection. I have this issue with Hashicorp's Vault Agent Injector Webhook in combination with Linkerd Init- and Sidecar-container injection. But with CockroachDB, the issue is not about init containers.
With CockroachDB, the main containers just simply cannot talk to each other anymore even if all of them are meshed.
I think this is caused by cockroach's delicate consensus mechanism. From my testing, this works as long as you issue node drain
on a pod before bouncing it so it becomes injected. This has to be done one by one.
@alpeb
Do you maybe have a log or script that I can use to repdroduce what you did to get it working?
Have you tried to reproduce what I did by following the steps in my original description?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.