linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Healthchecks/livenessProbe using gRPC in `all-authenticated` environment with `Server`

Open AlexGoris-KasparSolutions opened this issue 2 years ago • 10 comments

What is the issue?

I closely followed #7050 and was happy to see it was solved in 2.12.0, we took the time this week to upgrade our linkerd installation on our dev cluster (AKS using azure CNI, if that matters). Indeed HTTP/1 health checks worked flawlessly out of the box. Unfortunately I couldn't get HTTP/2 gRPC health checks working.

How can it be reproduced?

I have a basic .NET gRPC service (created as documented here) extended with a basic health check service (GrpcGreeter app extended like documented here). Locally I can call the health check service without problems.

I then deploy the docker image of this build into a namespace with has the following annotations:

config.linkerd.io/default-inbound-policy: deny
linkerd.io/inject: enabled

The Deployment resource has the following spec.containers[0].livenessProbe configured:

livenessProbe:
  grpc:
    port: 80
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Then added following service:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: grpc-greeter
  name: grpc-greeter
spec:
  ports:
    - port: 80
      targetPort: 80
  selector:
    app: grpc-greeter

And defined the following Server resource for it:

apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
  name: grpc-greeter
  labels:
    app: grpc-greeter
spec:
  podSelector:
    matchLabels:
      app: grpc-greeter
  port: 80

Logs, error output, etc

I can see the linkerd proxy is blocking the livenessProbe connections in the proxy's log:

[   148.375774s]  INFO ThreadId(01) inbound:accept{client.addr=172.16.36.207:39750}:server{port=80}:http{v=h2}:http{client.addr=172.16.36.207:39750 client.id="-" timestamp=2022-10-11T18:28:14.042295539Z method="POST" uri=http://172.16.36.231:80/grpc.health.v1.Health/Check version=HTTP/2.0 trace_id="" request_bytes="" user_agent="kube-probe/1.24 grpc-go/1.40.0" host=""}:rescue{client.addr=172.16.36.207:39750}: linkerd_app_core::errors::respond: Request failed error=unauthorized request on route
[   148.375780s] DEBUG ThreadId(01) inbound:accept{client.addr=172.16.36.207:39750}:server{port=80}:http{v=h2}:http{client.addr=172.16.36.207:39750 client.id="-" timestamp=2022-10-11T18:28:14.042295539Z method="POST" uri=http://172.16.36.231:80/grpc.health.v1.Health/Check version=HTTP/2.0 trace_id="" request_bytes="" user_agent="kube-probe/1.24 grpc-go/1.40.0" host=""}: linkerd_app_core::errors::respond: Handling error on gRPC connection code=The caller does not have permission to execute the specified operation

output of linkerd check -o short

Linkerd core checks
===================

linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.11.1 but the latest stable version is 2.12.1
    see https://linkerd.io/2.11/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
        * metrics-api-595c7b564-7ls6t (stable-2.11.4)
        * prometheus-77b9558b4b-4nqjm (stable-2.11.4)
        * tap-7f8f67546f-x624j (stable-2.11.4)
        * tap-injector-6b6c5c86d4-cqsv5 (stable-2.11.4)
        * web-6756f5956c-z4kdl (stable-2.11.4)
    see https://linkerd.io/2.11/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    grafana-db56d7cb4-qm44p running  but cli running stable-2.11.1
    see https://linkerd.io/2.11/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

  • Kubernetes version: 1.24.3
  • Cluster environment: AKS
  • Host OS: Linux (Ubuntu 18.04)
  • Linkerd version: 2.12.1

Possible solution

No response

Additional context

Important to note is that removing the Server resource allows the livenessProbe checks to go through.

I wanted to mention that my configuration for an http livenessProbe is exactly the same as aforementioned 'how to reproduce' description, except for the service being an ASP.NET Web API with health checks enabled on the /healthz route, and the deployment's livenessProbe configured like so:

livenessProbe:
  httpGet:
    path: /healthz
    port: 80
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Would you like to work on fixing this bug?

No response