Allow health probe paths without leading slash
What problem are you trying to solve?
Overview
I ran into a very odd issue recently that took me quite a while to figure out, but wanted to mention it here to maybe help improve linkerd and prevent this from happening in the future.
The issue has to do with pods with multiple containers. Specifically when running healthchecks/probes in these containers.
The issue
I was working with a very specific use case, but one that is easily reproduceable since it is using a relatively large product, JFrog Artifactory/Xray.
Within the Xray pods, there is a container, observability that calls another container for its healthcheck (for some reason), the router container. Because this is running on another container, I believe this strays from the behavior expected by linkerd and removes it from allowed policies. As a result, these healthchecks all fail and it isn't clear why (it took me quite a while to figure out this was going on, doesn't help that this was my first time using linkerd 🥲)
This can be seen by looking at the protocols defined in the proxy logs:
protocol: Detect {
http: [
Route {
hosts: [],
rules: [
Rule {
matches: [
MatchRequest {
path: Some(Exact("/router/api/v1/system/liveness")),
headers: [],
query_params: [],
method: Some(GET)
},
MatchRequest {
path: Some(Exact("/router/api/v1/system/readiness")),
headers: [],
query_params: [],
method: Some(GET)
}
],
policy: RoutePolicy {
meta: Default { name: "probe" },
authorizations: [
Authorization {
networks: [
Network { net: 0.0.0.0/0, except: [] },
Network { net: ::/0, except: [] }
],
authentication: Unauthenticated,
meta: Default { name: "probe" }
},
....
My guess is that the healthcheck routes get defined based on what container port is defined. The router defines a container port and uses these two routes, so this is what leads me to assume this.
The observability container does not defined a container port, but it does declare itself using the same port defined by the router container.
Below is also a snippet of the k8s manifest, which can also be viewed at https://github.com/jfrog/charts/blob/master/stable/xray/templates/xray-statefulset.yaml
containers:
- command:
- /bin/sh
- '-c'
- |
exec /opt/jfrog/router/app/bin/entrypoint-router.sh;
env:
- name: JF_ROUTER_TOPOLOGY_LOCAL_REQUIREDSERVICETYPES
value: jfxr,jfxana,jfxidx,jfxpst,jfob
image: releases-docker.jfrog.io/jfrog/router:7.135.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /router/api/v1/system/liveness
port: 8082
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: router
ports:
- containerPort: 8082
name: http-router
protocol: TCP
readinessProbe:
failureThreshold: 5
httpGet:
path: /router/api/v1/system/readiness
port: 8082
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources:
limits:
cpu: '1'
memory: 1Gi
requests:
cpu: 20m
memory: 50Mi
securityContext:
allowPrivilegeEscalation: false
appArmorProfile:
type: RuntimeDefault
capabilities:
drop:
- NET_RAW
runAsNonRoot: true
startupProbe:
failureThreshold: 30
httpGet:
path: /router/api/v1/system/readiness
port: 8082
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 5
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/opt/jfrog/router
name: data-volume
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-rcchr
readOnly: true
- command:
- /bin/sh
- '-c'
- |
exec /opt/jfrog/observability/app/bin/entrypoint-observability.sh;
image: releases-docker.jfrog.io/jfrog/observability:1.31.5
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: observability/api/v1/system/liveness
port: 8082
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: observability
resources:
limits:
cpu: '1'
memory: 250Mi
requests:
cpu: 10m
memory: 25Mi
securityContext:
allowPrivilegeEscalation: false
appArmorProfile:
type: RuntimeDefault
capabilities:
drop:
- NET_RAW
runAsNonRoot: true
startupProbe:
failureThreshold: 90
httpGet:
path: observability/api/v1/system/readiness
port: 8082
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 5
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/opt/jfrog/observability
name: data-volume
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-rcchr
readOnly: true
How should the problem be solved?
Proposed solution
By default, if a container port is defined and used in any healthcheck, that port and its healthchecks should be added to the probe policies for all other containers in the pod. There may be cases where somebody does not want containers in a pod to communicate with eachother, but I think that would be a very odd exception.
Any alternatives you've considered?
My solution
In order to fix this for my deployment, I created a specific HTTPRoute and a connected policy. I had to create two, one for the healthcheck routes, and one for all other routes. I had create the catch-all policy in order to allow my meshtlsauth to get access to everything else since creating any HTTPRoute takes precedence over pre-existing "Server-level" auth policies.
The solution worked, but was very difficult to discover and debug since healthchecks seemed to work in most other cases, the typical "it works here, why not over here? It looks the same"
How would users interact with this feature?
No response
Would you like to work on this feature?
no
hi @efrenaguilar95
I believe what is happening here is that the paths for the observability container's probes are missing the leading slash character. This causes them to fail to be parsed correctly as paths. I believe that if you update these probe paths to include the leading slash, Linkerd should authorize all probe paths, even if they are in different containers.
hey @adleong
Wow, great insight. Yes, I did just try this and can confirm that this did fix the problem.
Not sure if this would be something to add as an update to the probe path parser, or if I should go submit a PR to JFrog update their helm chart.
Either way, hopefully I can have a path towards removing a lot of the resources I had to create in order to have a working auth flow now.
I'll leave this open for the linkerd folks to decide what to do with this now. Thanks for your help!
@efrenaguilar95 glad to hear it! I've re-titled this issue to "Allow health probe paths without leading slash" and we can use this to track the work of relaxing Linkerd's restrictions to allow paths which don't have the leading slash.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
@adleong can I take a stab at trying to fix this looks like a good first issue?