Sporadic HTTP 403 Response Codes Due to TLS Handshake Failures
What is the issue?
We are encountering sporadic HTTP 403 response codes across all our clusters in the Linkerd-proxy setup. The attached logs capture the conversation flow from gojira-pod's Linkerd-proxy sidecar <-> gojira-pod's Linkerd-debug sidecar <-> mosura-pod's Linkerd-debug sidecar <-> mosura-pod's Linkerd-proxy sidecar, illustrating one such occurrence. Approximately 5 promille of all requests fail due to unsuccessful TLS handshakes, resulting in HTTP 403 response codes.
How can it be reproduced?
To reproduce the issue, you can generate approximately 1000 requests to a linkerd-proxy setup. A small percentage of these requests will fail, with the log message "Peer does not support TLS" occurring during the TLS handshake. The client and target are situated in different namespaces, with the client being gojira-pod and the server being mosura-pod, where mosura serves the kani-api. This issue can be replicated using any client communicating with any server that provides a REST API.
For detailed configuration, please refer to the "Additional context" section, where the configurations of authorizationpolicies.v1alpha1.policy.linkerd.io, servers.v1beta3.policy.linkerd.io, and meshtlsauthentications.v1alpha1.policy.linkerd.io can be reviewed.
Logs, error output, etc
gojira-debug.log.csv gojira-proxy.log.csv mosura-debug.log.csv mosura-proxy.log.csv
output of linkerd check -o short
kurisu@linkerd-worries$ linkerd check -n linkerd --linkerd-namespace linkerd --cni-namespace linkerd -o short
linkerd-identity
----------------
‼ trust anchors are valid for at least 60 days
Anchors expiring soon:
* 4116 BlueOysterCA will expire on 2025-05-23T16:53:17Z
see https://linkerd.io/2/checks/#l5d-identity-trustAnchors-not-expiring-soon for hints
‼ issuer cert is valid for at least 60 days
issuer certificate will expire on 2025-05-01T08:56:18Z
see https://linkerd.io/2/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
linkerd-webhooks-and-apisvc-tls
-------------------------------
‼ proxy-injector cert is valid for at least 60 days
Anchors expiring soon:
* 4116 BlueOysterCA will expire on 2025-05-23T16:53:17Z
* 558509666953049899959298119593666570293338784147 root.linkerd.cluster.local will expire on 2025-04-13T15:29:32Z
see https://linkerd.io/2/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
‼ sp-validator cert is valid for at least 60 days
Anchors expiring soon:
* 4116 BlueOysterCA will expire on 2025-05-23T16:53:17Z
* 558509666953049899959298119593666570293338784147 root.linkerd.cluster.local will expire on 2025-04-13T15:29:32Z
see https://linkerd.io/2/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
‼ policy-validator cert is valid for at least 60 days
Anchors expiring soon:
* 4116 BlueOysterCA will expire on 2025-05-23T16:53:17Z
* 558509666953049899959298119593666570293338784147 root.linkerd.cluster.local will expire on 2025-04-13T15:29:32Z
see https://linkerd.io/2/checks/#l5d-policy-validator-webhook-cert-not-expiring-soon for hints
linkerd-version
---------------
‼ cli is up-to-date
is running version 25.2.3 but the latest edge version is 25.3.3
see https://linkerd.io/2/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 25.1.2 but the latest edge version is 25.3.3
see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
control plane running edge-25.1.2 but cli running edge-25.2.3
see https://linkerd.io/2/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-5c7f864fcc-2zqp6 (edge-25.1.2)
* linkerd-destination-5c7f864fcc-k2k7t (edge-25.1.2)
* linkerd-identity-7554d6fdd-pjhvd (edge-25.1.2)
* linkerd-identity-7554d6fdd-qlwph (edge-25.1.2)
* linkerd-proxy-injector-6d47f9cc8-jht52 (edge-25.1.2)
* linkerd-proxy-injector-6d47f9cc8-splcs (edge-25.1.2)
see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
linkerd-destination-5c7f864fcc-2zqp6 running edge-25.1.2 but cli running edge-25.2.3
see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints
Status check results are √
Environment
kurisu@linkerd-worries$ kubectl version
Client Version: v1.31.0
Kustomize Version: v5.4.2
Server Version: v1.30.7
Possible solution
No response
Additional context
kurisu@linkerd-worries$ kubectl get -n mosura-land authorizationpolicies.v1alpha1.policy.linkerd.io
NAME AGE
kani-auth-gojira-land 16d
kani-auth-mosura-land 16d
kurisu@linkerd-worries$ kubectl get -n gojira-land authorizationpolicies.v1alpha1.policy.linkerd.io
No resources found in gojira-land namespace.
kurisu@linkerd-worries$ linkerd authz -n mosura-land deploy/kani-api
ROUTE SERVER AUTHORIZATION_POLICY SERVER_AUTHORIZATION
* kani-api-server kani-auth-gojira-land
* kani-api-server kani-auth-mosura-land
kurisu@linkerd-worries$ kubectl describe servers.v1beta3.policy.linkerd.io -n mosura-land kani-api-server
Name: kani-api-server
Namespace: mosura-land
Labels: app.kubernetes.io/managed-by=Helm
Annotations: meta.helm.sh/release-name: kani-api
meta.helm.sh/release-namespace: mosura-land
API Version: policy.linkerd.io/v1beta3
Kind: Server
Metadata:
Creation Timestamp: <redacted>
Generation: <redacted>
Resource Version: <redacted>
UID: <redacted>
Spec:
Access Policy: deny
Pod Selector:
Match Labels:
App: kani-api
Port: 8080
Proxy Protocol: unknown
Events: <none>
kurisu@linkerd-worries$ kubectl describe authorizationpolicies.v1alpha1.policy.linkerd.io -n mosura-land kani-auth-gojira-land
Name: kani-auth-gojira-land
Namespace: mosura-land
Labels: app.kubernetes.io/managed-by=Helm
Annotations: meta.helm.sh/release-name: kani-api
meta.helm.sh/release-namespace: mosura-land
API Version: policy.linkerd.io/v1alpha1
Kind: AuthorizationPolicy
Metadata:
Creation Timestamp: <redacted>
Generation: <redacted>
Resource Version: <redacted>
UID: <redacted>
Spec:
Required Authentication Refs:
Group: policy.linkerd.io
Kind: MeshTLSAuthentication
Name: kani-gojira-meshauth
Target Ref:
Group: policy.linkerd.io
Kind: Server
Name: kani-api-server
Events: <none>
kurisu@linkerd-worries$ kubectl describe authorizationpolicies.v1alpha1.policy.linkerd.io -n mosura-land kani-auth-mosura-land
Name: kani-auth-mosura-land
Namespace: mosura-land
Labels: app.kubernetes.io/managed-by=Helm
Annotations: meta.helm.sh/release-name: kani-api
meta.helm.sh/release-namespace: mosura-land
API Version: policy.linkerd.io/v1alpha1
Kind: AuthorizationPolicy
Metadata:
Creation Timestamp: <redacted>
Generation: <redacted>
Resource Version: <redacted>
UID: <redacted>
Spec:
Required Authentication Refs:
Group: policy.linkerd.io
Kind: MeshTLSAuthentication
Name: kani-mosura-meshauth
Target Ref:
Group: policy.linkerd.io
Kind: Server
Name: kani-api-server
Events: <none>
kurisu@linkerd-worries$ kubectl describe -n mosura-land meshtlsauthentications.v1alpha1.policy.linkerd.io kani-gojira-meshauth
Name: kani-gojira-meshauth
Namespace: mosura-land
Labels: app.kubernetes.io/managed-by=Helm
Annotations: meta.helm.sh/release-name: kani-api
meta.helm.sh/release-namespace: mosura-land
API Version: policy.linkerd.io/v1alpha1
Kind: MeshTLSAuthentication
Metadata:
Creation Timestamp: <redacted>
Generation: <redacted>
Resource Version: <redacted>
UID: <redacted>
Spec:
Identity Refs:
Kind: ServiceAccount
Name: gojira-proxy
Namespace: gojira-land
Events: <none>
kurisu@linkerd-worries$ kubectl describe -n mosura-land meshtlsauthentications.v1alpha1.policy.linkerd.io kani-mosura-meshauth
Name: kani-mosura-meshauth
Namespace: mosura-land
Labels: app.kubernetes.io/managed-by=Helm
Annotations: meta.helm.sh/release-name: kani-api
meta.helm.sh/release-namespace: mosura-land
API Version: policy.linkerd.io/v1alpha1
Kind: MeshTLSAuthentication
Metadata:
Creation Timestamp: <redacted>
Generation: <redacted>
Resource Version: <redacted>
UID: <redacted>
Spec:
Identity Refs:
Kind: ServiceAccount
Name: default
Namespace: mosura-land
Events: <none>
Would you like to work on fixing this bug?
None
related issues https://github.com/linkerd/linkerd2/issues/13013 and https://github.com/linkerd/linkerd2/issues/6548
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
I'd like to share an update based on some recent testing I conducted:
I ran a series of experiments using various combinations of Linkerd versions (edge-24.8.2, edge-25.1.2, and edge-25.4.2), Gateway API configurations (as provided by the Linkerd installation), and NGINX Ingress Controller versions (4.11.2, 4.12.0, and 4.12.1) across multiple kind clusters running Kubernetes (versions 1.30.7 and 1.31.8). In each scenario, I simulated traffic by sending a substantial number of requests (two parallel consumers, each issuing 1000 sequential requests) through the ingress to a backend pod within a Linkerd service mesh.
In none of these test cases was I able to reproduce the 403 Forbidden errors.
That said, since we continue to observe sporadic 403 responses in our production environment - and considering that @cratelyn kindly removed the wontfix label - I assume there may be some level of ongoing investigation or interest in this issue.
Would you recommend any interim mitigation strategies to help reduce the frequency or impact of these errors while the root cause is still being explored?
Hi @soma-kurisu thanks for all the work you've already done investigating this. We've tried to track down these types of sporadic TLS issues in the past but they've been hard to nail down without a consistent way to reproduce them. In some cases, they can happen when a node is CPU constrained, so the overall health of your node might be something to look at.
I'll also point out that Linkerd's TLS handling has changed significantly between edge-25.2.1 (where you reported the issue) and now (e.g. edge-25.6.4). I would recommend upgrading to a more recent edge release and seeing if the TLS errors occur there.
Thanks for the input, @adleong - much appreciated.
We'll take a closer look at potential CPU constraints on the nodes and will also try upgrading to a more recent edge release like edge-25.6.4 to see if the issue persists.
I'll also point out that Linkerd's TLS handling has changed significantly between edge-25.2.1 (where you reported the issue) and now (e.g. edge-25.6.4). I would recommend upgrading to a more recent edge release and seeing if the TLS errors occur there.
Do you happen if there is a specific version where the change (or the bulk of changes) happened? We since upgraded to edge-25.4.2 but are still seeing the issues.
@adleong some updates on this while @soma-kurisu is away:
We've encountered the issue again this week during a production deployment. At the time, we were running edge-25.4.2. In response to the incident, we upgraded our clusters to edge-25.6.2 as any newer version is affected by #14298 which is also an issue for our workloads. Also, at the time of occurrence, the nodes were not CPU constrained at all.
Next Monday there will be another attempt at the production deployment that casued the most recent incident. We'll observe and let you know if we have finally found a way for reliable reproduction of the issue or if the upgrade seems to have fixed things.
Unfortunately the issue occurred despite the upgrade to 25.6.2. This time we were lucky enough to be able to get a look at the in-progress issue and could gather some more information.
- We observed the behavior in a workload with 6 replicas on the receiving side.
- All replicas received load balanced traffic, but only one of the replicas had the 403-issue.
- All replicas were on separate nodes.
- While the issue happened we didn't observe any kind of CPU, memory, or IO pressure on the node, pod, or linkerd-proxy container
- No other pod on the same node, or any other node on the cluster, generated 403 messages while the incident was in progress
- Restarting the failing replica fixed the issue
@adleong we have a major update on the behavior observed. During the past week, the issue has occurred in production after two different application deployments. In one case, the deployment was rolled back, in the other case, the affected pod (1/6) was restarted. What is noteworthy is that once the issue occurs, it persists for some time until it potentially goes away itself.
Since we cannot reliably reproduce the start of the problem, troubleshooting has been quite difficult as all we had to work with were logs and never the pods showing the odd behavior. Yesterday, I was working on configuring linkerd-viz in a sandbox environment with no external load at all when the problem randomly occurred on the pod I had the log stream open on (talk about luck). Here's the log file: demotenant-api-application-apiapp-deployment-7695c8bcc8-b7hcc-1754322047371503600.log
Observations from the debug session
- The pod from which the logs are extracted was restarted as part of a scaling and immediately started to show the 403 with
NoClientHellobehavior (log file starts at 44 seconds, the issue was there before). The IP10.207.204.11is prometheus trying to scrape the metrics from the admin API. - Prometheus continued to re-try the calls every few seconds, every time with the same failure
- At 143s, I tried to call the pod from a test pod in another namespace which did not have any
AuthorizationPolicyto allow that connection. As expected, this resulted in a 403 but the main difference is that in this case the client is identified. - Between 184s and 194s, I restarted the
prometheuspod which resulted in the IP changing to10.207.204.44. From that point on, the prometheus connections also fail with 403 but now with identity which is expected as there was noAuthorizationPolicyin place yet. - While this was happening, prometheus was also trying to scrape other pods where it did not encounter the
NoClientHelloproblem.
What we know so far
- The issue appears randomly, most likely after workload deployments
- It only affects a single pairing of src / dst - the client is able to establish other connections and the server is able to receive other connections
- Restarting either the client or the service pod will bring things back to normal
- The issue tends to go away itself in the evening (where we have much lower load on the system)
What we believe is happening
- Service discovery fails for a connection due to a unknown reason
- The client proxy caches this discovery result for 5 seconds due to the default setting of
outboundDiscoveryCacheUnusedTimeout - All subsequent calls to the same destination from this client will fail unless there is a period of time where no call happens for at least 5 seconds.
This would explain why only a single pairing of src / dst is affected. Also, why restarting any of the two pods solves the problem (in one case the cache is destroyed, in the other the destination changes and the cache is not accessed anymore) and why the problem tends to go away when load decreases (less calls = more likely to make 5s without outbound call).
Conclusion We still do not know the reason for the initial discovery failure. However, it is clearly a transient fault of sorts.
I see two possible solutions for this problem (without being familiar with the codebase):
- Adapt the caching logic to not cache discovery results where the TLS handshake failed
- Implement a second cache timeout (e.g.
outboundDiscoveryCacheTimeout) that is not specific to unused cache items. This way, regular re-discovery could be forced to take place, making sure transient discovery faults do not turn into persistent ones.
Thanks for the detailed report @bkittinger . Quick question just to rule out something similar to #14103, which was fixed on edge-25.6.3: are you using native sidecars?
@alpeb We are using native sidecars.
However, the issue here does not lie within the policy engine - it is doing what it is supposed to. As @soma-kurisu pointed out in his initial post, the problem is a failed TLS handshake between the proxies, resulting in a fallback to plain HTTP communications and then a block by the inbound policy as the client is unauthenticated (since there is no certificate). And because the discovery result is cached, a single failed handshake can cause a persisting issue.
It seems we may need a bit more information to form a hypothesis about the cause. Could you help clarify why a fallback to plain text might occur if the TLS handshake fails? That is something that the proxy on the inbound side wouldn't do. In the reproduction steps you provided, it appears that Prometheus, as the client, is encountering the issue. Have you set up different scrape configurations using both HTTPS and HTTP schemes?
The inbound side blocks incoming requests since they don't use tls, but we suspect that the issue might start on the client side, either due to a failed handshake or because of service discovery issues. So basically -> Client calls server -> Something happens -> Client thinks that server doesn't support tls and falls back to http -> Server refuses connection because it doesn't use tls
Why this initial issue is happening is exactly what we are trying to find out. One of the triggers that we can observe pretty consistently is that the fallback happens if the proxy receives a request within the first ~2-3 seconds after the container was started. Which might be why why we see the prometheus that's shipped with viz being a strong trigger for the behavior, as it doesn't wait for the pod to be ready before sending scraping requests to the proxy. Note that this the fallback is route-specific. If the prometheus runs into the issue, the proxy accepts all other requests just fine, and vice versa.
It's not limited to prometheus, though. We are able to trigger it with regular requests too. I attached a helm chart we are using to try to pin down the issue. The setup is that we have a server pod running a simple webserver behind a service on port 8080, and a helper pod that does a rolling restart of the webserver every few seconds to have its linkerd-proxy container constantly restart. We then have two types of client constantly curling the server.
- A regular client that tries to call the server in a safe way by going over its service and waiting until the pod is ready
- A "direct" client that tries to brute-force the server by ignoring the service, and directly calling the server pods via their IPs regardless of if the pod is ready or not
In addition to that there is some network plumbing like linkerd authorization policies, service accounts, role bindings, etc. The namespaces has 'config.linkerd.io/default-inbound-policy: deny' and 'linkerd.io/inject: enabled' annotations set. It also has policies to whitelist the prometheus service account, but those are deployed separately, so they are not part of this helm chart.
We then use the viz prometheus to scrape the metrics of the client pods and get the following result after a few hours
sum(outbound_http_route_request_statuses_total{namespace="demotenant-linkerd3"}) by(namespace,component,http_status,error)
Ignoring the 3xx and 5xx responses that are a result of our brute-force approach, we can see that the regular clients, which wait for the pod to be completely ready, have a success rate of close to 100%, while the clients making requests as soon as the server pod starts up receive noclienthello/403 responses about 10% of the time.
In this scenario we used 3 servers, 3 brute-force clients and 15 regular clients. So even though the regular clients outnumber the brute-force ones 5 to 1 they are working absolutely fine, which seems to indicate that it's not a question of request volume overwhelming the proxy, but a matter of when and how requests are happening.
With this we can say under what circumstances the issue occurs, but not why. Also, some of our users report that the issue started for them with already running pods, so there might be additional triggers, but since we don't know what the root cause is we don't have a reproduction scenario for that.
I've tried the repro and could see the sporadic 403 errors pop up.
The problem is the server pod won't get admitted into the mesh until it's on the Running state. However, it will be discoverable by proxies on clients hitting the pod IP directly, and those clients won't attempt to establish an mTLS connection until the target pod gets into the Running state. Proxies on clients should receive a discovery update as soon as the target becomes Runnable. This explains the behavior in the repro, but doesn't explain the other scenarios you describe for already running pods or for the prometheus client that remains in a bad state until it gets restarted.
For those scenarios there might be staleness in the destination controller that is preventing to send proper updates about the target pod. You can monitor the lag on the discovery updates by following the endpointslices_informer_lag_seconds metrics in the destination controller. It'd be great if you could correlate spikes in that against the appearance of the 403s.
Also, the proxy logs in the client, at debug level, will tell when the target's discovery didn't provide TLS information( linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery) and then when that discovery information gets updated to support TLS (linkerd_tls::client: alpn=transport.l5d.io/v1).
Awesome, happy to hear that the reproduction scenario is not specific to just our environment. I'll see if we can set up some monitoring to correlate the issue with the endpointslices_informer_lag_seconds metrics. But we need to roll this out to all stages until prod, so it might take a couple of days.
In regards to prometheus - isn't that the same (or very similar) issue? In the reproduction helm chart I'm calling the server as soon as it's avaialble, regardless if the pod is ready or not. With this I tried to mimic the behavior of prometheus because it also starts scraping the proxies as soon as they start up.
The issue that prometheus has is also route specific. If it's trying to scrape 10 pods, and 3 of those 10 pods fail to do a handshake it won't be able to scrape these pods because the failed handshake will be cached, but the other 7 pods can be scraped just fine. You also don't need to restart prometheus to resolve it, you can just kill the individual client pods having the issue.
In regards to prometheus - isn't that the same (or very similar) issue? In the reproduction helm chart I'm calling the server as soon as it's avaialble, regardless if the pod is ready or not. With this I tried to mimic the behavior of prometheus because it also starts scraping the proxies as soon as they start up.
In the repro helm chart the client faces a 403 but just sporadically for one request, and then the following requests succeeds. This is different than the issue you mentioned before:
Prometheus continued to re-try the calls every few seconds, every time with the same failure
Yeah, I haven't figured out how to reproduce that yet. When the issue occurs in prod we see that subsequent calls from the same route keep failing, similar to how prometheus behaves. That said, I don't really understand why it keeps failing for prometheus. I set up the following annotations to reduce the cache timeout on both the linkerd-viz and client namespaces:
apiVersion: v1
kind: Pod
metadata:
annotations:
config.linkerd.io/default-inbound-policy: deny
config.linkerd.io/proxy-inbound-discovery-cache-unused-timeout: 5s
config.linkerd.io/proxy-outbound-discovery-cache-unused-timeout: 5s
linkerd.io/inject: enabled
After that I increased the prometheus scrape interval and timeout to 15 seconds:
apiVersion: v1
data:
prometheus.yml: |-
global:
evaluation_interval: 10s
scrape_interval: 15s
scrape_timeout: 15s
rule_files:
- /etc/prometheus/*_rules.yml
- /etc/prometheus/*_rules.yaml
To my understanding, this should fix the issues that prometheus has when scraping pods, because even if the initial scrape fails when a pod wasn't admitted to the mesh yet, subsequent scrapes should work fine because the scrape interval is triple of the proxy's cache timeout. But after restarting all client pods and waiting for a few minutes, the failing scrapes seem to be sticky:
2025-08-19T12:17:37.336391253Z [ 0.001047s] INFO ThreadId(01) linkerd2_proxy: release 2.300.0 (67dc85a) by linkerd on 2025-06-02T22:14:32Z
2025-08-19T12:17:37.338071500Z [ 0.002756s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
2025-08-19T12:17:37.338693408Z [ 0.003382s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
2025-08-19T12:17:37.338702761Z [ 0.003391s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
2025-08-19T12:17:37.338704944Z [ 0.003393s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
2025-08-19T12:17:37.338707257Z [ 0.003394s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
2025-08-19T12:17:37.338709741Z [ 0.003395s] INFO ThreadId(01) linkerd2_proxy: SNI is default.demotenant-linkerd3.serviceaccount.identity.linkerd.cluster.local
2025-08-19T12:17:37.338712775Z [ 0.003396s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.demotenant-linkerd3.serviceaccount.identity.linkerd.cluster.local
2025-08-19T12:17:37.338715016Z [ 0.003398s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
2025-08-19T12:17:38.723704556Z [ 1.388232s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=10.207.202.6:44004}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=no route found for request
2025-08-19T12:17:39.723837101Z [ 2.388485s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=10.207.202.6:44016}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=no route found for request
2025-08-19T12:17:40.339387300Z [ 3.004022s] WARN ThreadId(01) watch{port=4191}: linkerd_app_inbound::policy::api: Unexpected policy controller response; retrying with a backoff grpc.status=Deadline expired before operation could complete grpc.message="initial item not received within timeout"
2025-08-19T12:17:40.717124817Z [ 3.381773s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=10.207.203.107:46386}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=no route found for request
2025-08-19T12:17:40.723405658Z [ 3.388060s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=10.207.202.6:44026}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=no route found for request
2025-08-19T12:17:41.723742031Z [ 4.388400s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=10.207.202.6:44040}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=no route found for request
[waiting a few miutes]
2025-08-19T12:20:10.715889642Z [ 153.380508s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_inbound::policy::http: Request denied server.group=policy.linkerd.io server.kind=server server.name=linkerd-proxy-admin route.group=policy.linkerd.io route.kind=HTTPRoute route.name=linkerd-proxy-metrics client.tls=None(NoClientHello) client.ip=10.207.203.107
2025-08-19T12:20:10.715917770Z [ 153.380537s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=10.207.203.107:38630}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=unauthorized request on route
2025-08-19T12:20:25.716431310Z [ 168.381078s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_inbound::policy::http: Request denied server.group=policy.linkerd.io server.kind=server server.name=linkerd-proxy-admin route.group=policy.linkerd.io route.kind=HTTPRoute route.name=linkerd-proxy-metrics client.tls=None(NoClientHello) client.ip=10.207.203.107
2025-08-19T12:20:25.716454084Z [ 168.381106s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=10.207.203.107:48738}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=unauthorized request on route
2025-08-19T12:20:40.727765479Z [ 183.392329s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_inbound::policy::http: Request denied server.group=policy.linkerd.io server.kind=server server.name=linkerd-proxy-admin route.group=policy.linkerd.io route.kind=HTTPRoute route.name=linkerd-proxy-metrics client.tls=None(NoClientHello) client.ip=10.207.203.107
2025-08-19T12:20:40.727816414Z [ 183.392399s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=10.207.203.107:55414}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=unauthorized request on route
At that point the client pod has been continously (successfully) sending requests to other meshed pods, so it's definitely meshed.
Can you please post the logs for the proxy in the prometheus pod, at debug level (using the annotation config.linkerd.io/proxy-log-level: linkerd=debug)?
Here you go. That gist contains debug logs of both prometheus and one of the clients it was unable to scrape (linkerd-403-looper-client-777bfd6f7b-zwffq/10.207.204.65).
I managed to consistently reproduce the issue with a default Prometheus client and a simple server like the one provided in your repro with a default-deny policy. Bouncing the server pod enough times so that the scraping occurs during the server's initialization stage prior to reaching the Running state, puts the Prometheus client in the failure loop. Apparently, Prometheus uses persistent connections even for targets returning 403. This is the HTTP config it uses, which has keepalive enabled by default. I couldn't find a way to change those defaults in the Prometheus config.
Given the persistence of the connection, the proxy doesn't set it up for mTLS even when new discovery information comes in, after the target pod is admitted in the mesh. This behavior was introduced in #13557 to allow multi-node systems like Cassandra to be able to bootstrap their meshed nodes. The solution here might probably be to have that behavior be opted-in in selected workloads only.
Thanks for the update and the link to #13557. The timing of this change also lines up with the first time we've seen the issue with sporadic 403's popping up in the cluster.
Making the behavior configurable as you suggested would certainly fix the problem, however pods that need this option enabled could then again run into the same problem. The solution we propose is to update the behavior to the following:
- Wait for the pod to be running
- If there is a
linkerd-proxycontainer or init container, wait for that container to be ready
This way, the race condition on pod startup is fixed without interfering with workloads like Cassandra that need to bootstrap. Also, by waiting for the proxy to be ready, we can ensure that no unencrypted connections can be initiated during startup.
We have implemented a possible fix in one of our clusters and are running the tests against this version. If the tests are successful, we will raise a PR.
Interesting! Looking forward for your tests results 👍
So... unfortunately we have some updates that are a bit less exciting.
We have tested the approach of checking the proxy readiness and while it did not break anything, it also did not fix our problem. To check that we were looking at the right place, we reverted the changes to workload_watcher.go made in #13557 - which also did not fix the problem.
Which led us to the next troubleshooting step which was to roll back to the linkerd version we were using before our 02/2025 upgrade cycle, specifically edge-24.8.2 and to our surprise, the behavior stayed the same.
Our conclusion for now is that the issue might have existed prior to the upgrade and just nobody noticed it before. But also that the proposed change does not fix things.
Do you maybe have any other ideas? We're at a loss to be honest.
Thanks again for all this great feedback @bkittinger . It turns out there's a bug with the native sidecar model provoking the discovery cache in the Destination controller to become stale. I'll be pushing a fix soon, but in the meantime can you try changing your linkerd's setting to proxy.nativeSidecar: false and then bouncing the target pods? If my theory is correct, that should fix your issue, at least temporarily.
Now that was the thing we were looking for - looks like you solved the puzzle. In fact, I was struggling to reproduce the issue on my local machine in a kind cluster for the better part of the week, trying to figure out what was different about my local setup that I kept as vanilla as possible.
Well - as soon as I switch the proxies to the native sidecar mode, the problem appears instantly, with the same ~30% failure rate we've been seeing everywhere else. Turning native sidecars off immediately gets things back to normal.
As I'm on vacation for the next two weeks, I'm not going to be able to test any possible fixes but I'm sure @micke-post and @soma-kurisu will be happy to provide any information or help needed.
Thanks again for all the help and insights, @alpeb - this issue has been driving us and our devs crazy for months...
I also tried it with my reproduction helm chart, and after setting the config.alpha.linkerd.io/proxy-enable-native-sidecar: "false" annotation on the namespace prometheus was able to scrape all endpoints without further issues. Thanks for the help so far!