linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Connection refused randomly for pairs of pods

Open zack-littke-smith-ai opened this issue 8 months ago • 4 comments

What is the issue?

I am running into a really difficult-to-reproduce issue where our k8s pod will somehow decide it will not serve certain clients, giving logs in the client proxy:

WARN ThreadId(01) linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

And:

INFO ThreadId(01) outbound:proxy{addr=10.100.32.3:10079}:rescue{client.addr=172.28.187.94:55562}: linkerd_app_core::errors::respond: gRPC request failed error=logical service service-name.namespace.svc.cluster.local:10079: service unavailable error.sources=[service unavailable]

However during this time, the service does successfully connect to other clients and serve their requests descriminately. Restarting the clients has no effect, and restarting the service can 'sometimes' help, resulting in reconnection to some clients but failure to reconnect to others.

The only 'solution' we've seen success with is restarting every single linkerd container and proxy-having service, which is not ideal to say the least.

While I have no solid repro, I'm hoping to at least take away some debugging tips for the next time this happens to us.

How can it be reproduced?

Unfortunately I have not been able to reliably reproduce in our own environments

Logs, error output, etc

Proxy logs from the service:

[ 0.001766s] INFO ThreadId(01) linkerd2_proxy: release 2.210.0 (85db2fc) by linkerd on 2023-09-21T21:24:58Z
[ 0.002498s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.003107s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[ 0.003116s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[ 0.003118s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.003121s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[ 0.003122s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.namespace.serviceaccount.identity.linkerd.cluster.local
[ 0.003124s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.003126s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.019669s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=default.namespace.serviceaccount.identity.linkerd.cluster.local
[ 0.001800s] INFO ThreadId(01) linkerd2_proxy: release 2.210.0 (85db2fc) by linkerd on 2023-09-21T21:24:58Z
[ 0.002498s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.003148s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[ 0.003164s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[ 0.003166s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.003168s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[ 0.003171s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.namespace.serviceaccount.identity.linkerd.cluster.local
[ 0.003173s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.003175s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.012067s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=default.namespace.serviceaccount.identity.linkerd.cluster.local

Logs from the client proxy included above

output of linkerd check -o short

---------------
‼ cli is up-to-date
    unsupported version channel: stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    unsupported version channel: stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-destination-6954bdcf79-6p7z5 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-destination-6954bdcf79-df9f2 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-destination-6954bdcf79-jnncs (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-identity-5958cdbd64-gc2qp (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-identity-5958cdbd64-ph8v8 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-identity-5958cdbd64-qsh5m (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-proxy-injector-7664c7cf84-77vl9 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-proxy-injector-7664c7cf84-khhfp (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* linkerd-proxy-injector-7664c7cf84-xzz9x (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-6954bdcf79-6p7z5 running 3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda but cli running stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-cli-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* grafana-6c4c8b997d-ptswf (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* metrics-api-7d685f8896-f4d52 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* prometheus-dd8b5b7f4-2rsgn (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* tap-59769cd568-7t92z (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* tap-injector-6f987fddf9-f9fs5 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
	* web-7c6ff5b7d-7tdb6 (3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda)
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    grafana-6c4c8b997d-ptswf running 3cd7d7a0849f124af2156783ae1989d0a1248d412341cd97f781e60feae98dda but cli running stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

linkerd_controller: stable-2.14.1 linkerd_debug: stable-2.14.1 linkerd_grafana: stable-2.11.1 linkerd_metrics_api: stable-2.14.1 linkerd_policy_controller: stable-2.14.1 linkerd_proxy: stable-2.14.1 linkerd_proxy_init: v2.2.3 linkerd_tap: stable-2.14.1 linkerd_web: stable-2.14.1

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

zack-littke-smith-ai avatar Jun 04 '24 23:06 zack-littke-smith-ai