What problem are you trying to solve?

Overview

I ran into a very odd issue recently that took me quite a while to figure out, but wanted to mention it here to maybe help improve linkerd and prevent this from happening in the future.

The issue has to do with pods with multiple containers. Specifically when running healthchecks/probes in these containers.

The issue

I was working with a very specific use case, but one that is easily reproduceable since it is using a relatively large product, JFrog Artifactory/Xray.

Within the Xray pods, there is a container, observability that calls another container for its healthcheck (for some reason), the router container. Because this is running on another container, I believe this strays from the behavior expected by linkerd and removes it from allowed policies. As a result, these healthchecks all fail and it isn't clear why (it took me quite a while to figure out this was going on, doesn't help that this was my first time using linkerd 🥲)

This can be seen by looking at the protocols defined in the proxy logs:

protocol: Detect {
  http: [
    Route {
      hosts: [],
      rules: [
        Rule {
          matches: [
            MatchRequest {
              path: Some(Exact("/router/api/v1/system/liveness")),
              headers: [],
              query_params: [],
              method: Some(GET)
            },
            MatchRequest {
              path: Some(Exact("/router/api/v1/system/readiness")),
              headers: [],
              query_params: [],
              method: Some(GET)
            }
          ],
          policy: RoutePolicy {
            meta: Default { name: "probe" },
            authorizations: [
              Authorization {
                networks: [
                  Network { net: 0.0.0.0/0, except: [] },
                  Network { net: ::/0, except: [] }
                ],
                authentication: Unauthenticated,
                meta: Default { name: "probe" }
              },
....

My guess is that the healthcheck routes get defined based on what container port is defined. The router defines a container port and uses these two routes, so this is what leads me to assume this.

The observability container does not defined a container port, but it does declare itself using the same port defined by the router container.

Below is also a snippet of the k8s manifest, which can also be viewed at https://github.com/jfrog/charts/blob/master/stable/xray/templates/xray-statefulset.yaml

containers:
    - command:
        - /bin/sh
        - '-c'
        - |
          exec /opt/jfrog/router/app/bin/entrypoint-router.sh;
      env:
        - name: JF_ROUTER_TOPOLOGY_LOCAL_REQUIREDSERVICETYPES
          value: jfxr,jfxana,jfxidx,jfxpst,jfob
      image: releases-docker.jfrog.io/jfrog/router:7.135.2
      imagePullPolicy: IfNotPresent
      livenessProbe:
        failureThreshold: 5
        httpGet:
          path: /router/api/v1/system/liveness
          port: 8082
          scheme: HTTP
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
      name: router
      ports:
        - containerPort: 8082
          name: http-router
          protocol: TCP
      readinessProbe:
        failureThreshold: 5
        httpGet:
          path: /router/api/v1/system/readiness
          port: 8082
          scheme: HTTP
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
      resources:
        limits:
          cpu: '1'
          memory: 1Gi
        requests:
          cpu: 20m
          memory: 50Mi
      securityContext:
        allowPrivilegeEscalation: false
        appArmorProfile:
          type: RuntimeDefault
        capabilities:
          drop:
            - NET_RAW
        runAsNonRoot: true
      startupProbe:
        failureThreshold: 30
        httpGet:
          path: /router/api/v1/system/readiness
          port: 8082
          scheme: HTTP
        initialDelaySeconds: 30
        periodSeconds: 5
        successThreshold: 1
        timeoutSeconds: 5
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /var/opt/jfrog/router
          name: data-volume
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: kube-api-access-rcchr
          readOnly: true
    - command:
        - /bin/sh
        - '-c'
        - |
          exec /opt/jfrog/observability/app/bin/entrypoint-observability.sh;
      image: releases-docker.jfrog.io/jfrog/observability:1.31.5
      imagePullPolicy: IfNotPresent
      livenessProbe:
        failureThreshold: 5
        httpGet:
          path: observability/api/v1/system/liveness
          port: 8082
          scheme: HTTP
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
      name: observability
      resources:
        limits:
          cpu: '1'
          memory: 250Mi
        requests:
          cpu: 10m
          memory: 25Mi
      securityContext:
        allowPrivilegeEscalation: false
        appArmorProfile:
          type: RuntimeDefault
        capabilities:
          drop:
            - NET_RAW
        runAsNonRoot: true
      startupProbe:
        failureThreshold: 90
        httpGet:
          path: observability/api/v1/system/readiness
          port: 8082
          scheme: HTTP
        initialDelaySeconds: 30
        periodSeconds: 5
        successThreshold: 1
        timeoutSeconds: 5
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /var/opt/jfrog/observability
          name: data-volume
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: kube-api-access-rcchr
          readOnly: true

How should the problem be solved?

Proposed solution

By default, if a container port is defined and used in any healthcheck, that port and its healthchecks should be added to the probe policies for all other containers in the pod. There may be cases where somebody does not want containers in a pod to communicate with eachother, but I think that would be a very odd exception.

Any alternatives you've considered?

My solution

In order to fix this for my deployment, I created a specific HTTPRoute and a connected policy. I had to create two, one for the healthcheck routes, and one for all other routes. I had create the catch-all policy in order to allow my meshtlsauth to get access to everything else since creating any HTTPRoute takes precedence over pre-existing "Server-level" auth policies.

The solution worked, but was very difficult to discover and debug since healthchecks seemed to work in most other cases, the typical "it works here, why not over here? It looks the same"

How would users interact with this feature?

No response

Would you like to work on this feature?

no

May 16 '25 21:05 efrenaguilar95

hi @efrenaguilar95

I believe what is happening here is that the paths for the observability container's probes are missing the leading slash character. This causes them to fail to be parsed correctly as paths. I believe that if you update these probe paths to include the leading slash, Linkerd should authorize all probe paths, even if they are in different containers.

May 21 '25 22:05 adleong

hey @adleong

Wow, great insight. Yes, I did just try this and can confirm that this did fix the problem.

Not sure if this would be something to add as an update to the probe path parser, or if I should go submit a PR to JFrog update their helm chart.

Either way, hopefully I can have a path towards removing a lot of the resources I had to create in order to have a working auth flow now.

I'll leave this open for the linkerd folks to decide what to do with this now. Thanks for your help!

May 21 '25 23:05 efrenaguilar95

@efrenaguilar95 glad to hear it! I've re-titled this issue to "Allow health probe paths without leading slash" and we can use this to track the work of relaxing Linkerd's restrictions to allow paths which don't have the leading slash.

May 22 '25 17:05 adleong

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Aug 20 '25 18:08 stale[bot]

@adleong can I take a stab at trying to fix this looks like a good first issue?

Oct 17 '25 03:10 king-11