linkerd2
linkerd2 copied to clipboard
Healthchecks/livenessProbe using gRPC in `all-authenticated` environment with `Server`
What is the issue?
I closely followed #7050 and was happy to see it was solved in 2.12.0, we took the time this week to upgrade our linkerd installation on our dev cluster (AKS using azure CNI, if that matters). Indeed HTTP/1 health checks worked flawlessly out of the box. Unfortunately I couldn't get HTTP/2 gRPC health checks working.
How can it be reproduced?
I have a basic .NET gRPC service (created as documented here) extended with a basic health check service (GrpcGreeter app extended like documented here). Locally I can call the health check service without problems.
I then deploy the docker image of this build into a namespace with has the following annotations:
config.linkerd.io/default-inbound-policy: deny
linkerd.io/inject: enabled
The Deployment
resource has the following spec.containers[0].livenessProbe
configured:
livenessProbe:
grpc:
port: 80
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
Then added following service:
apiVersion: v1
kind: Service
metadata:
labels:
app: grpc-greeter
name: grpc-greeter
spec:
ports:
- port: 80
targetPort: 80
selector:
app: grpc-greeter
And defined the following Server
resource for it:
apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
name: grpc-greeter
labels:
app: grpc-greeter
spec:
podSelector:
matchLabels:
app: grpc-greeter
port: 80
Logs, error output, etc
I can see the linkerd proxy is blocking the livenessProbe connections in the proxy's log:
[ 148.375774s] INFO ThreadId(01) inbound:accept{client.addr=172.16.36.207:39750}:server{port=80}:http{v=h2}:http{client.addr=172.16.36.207:39750 client.id="-" timestamp=2022-10-11T18:28:14.042295539Z method="POST" uri=http://172.16.36.231:80/grpc.health.v1.Health/Check version=HTTP/2.0 trace_id="" request_bytes="" user_agent="kube-probe/1.24 grpc-go/1.40.0" host=""}:rescue{client.addr=172.16.36.207:39750}: linkerd_app_core::errors::respond: Request failed error=unauthorized request on route
[ 148.375780s] DEBUG ThreadId(01) inbound:accept{client.addr=172.16.36.207:39750}:server{port=80}:http{v=h2}:http{client.addr=172.16.36.207:39750 client.id="-" timestamp=2022-10-11T18:28:14.042295539Z method="POST" uri=http://172.16.36.231:80/grpc.health.v1.Health/Check version=HTTP/2.0 trace_id="" request_bytes="" user_agent="kube-probe/1.24 grpc-go/1.40.0" host=""}: linkerd_app_core::errors::respond: Handling error on gRPC connection code=The caller does not have permission to execute the specified operation
output of linkerd check -o short
Linkerd core checks
===================
linkerd-version
---------------
‼ cli is up-to-date
is running version 2.11.1 but the latest stable version is 2.12.1
see https://linkerd.io/2.11/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ viz extension proxies are up-to-date
some proxies are not running the current version:
* metrics-api-595c7b564-7ls6t (stable-2.11.4)
* prometheus-77b9558b4b-4nqjm (stable-2.11.4)
* tap-7f8f67546f-x624j (stable-2.11.4)
* tap-injector-6b6c5c86d4-cqsv5 (stable-2.11.4)
* web-6756f5956c-z4kdl (stable-2.11.4)
see https://linkerd.io/2.11/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
grafana-db56d7cb4-qm44p running but cli running stable-2.11.1
see https://linkerd.io/2.11/checks/#l5d-viz-proxy-cli-version for hints
Status check results are √
Environment
- Kubernetes version: 1.24.3
- Cluster environment: AKS
- Host OS: Linux (Ubuntu 18.04)
- Linkerd version: 2.12.1
Possible solution
No response
Additional context
Important to note is that removing the Server
resource allows the livenessProbe checks to go through.
I wanted to mention that my configuration for an http livenessProbe is exactly the same as aforementioned 'how to reproduce' description, except for the service being an ASP.NET Web API with health checks enabled on the /healthz
route, and the deployment's livenessProbe
configured like so:
livenessProbe:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
Would you like to work on fixing this bug?
No response