linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

linkerd-destination policy fails on initial pod list, if it contains an object with invalid values

Open baracoder opened this issue 2 years ago • 2 comments

What is the issue?

If a single pod object contains an invalid value, the policy container of linkerd-destionation pods fails to become ready on parsing initial pod list. Without this deployment being available, all meshed pods stop working properly.

How can it be reproduced?

  1. Create a pod with invalid value null in spec.volumes[].projected.sources
cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
    name: fail-parsing
    namespace: default
    annotations:
      linkerd.io/inject: enabled
spec:
    volumes:
        - name: exporter-config
          projected:
              sources: null
              defaultMode: 420
    containers:
        - name: exporter
          image: "busybox"
          args:
              - sleep
              - 1h
          volumeMounts:
              - name: exporter-config
                mountPath: /conf
EOF

Note: The kubernetes API server accepts this object 2. Restart linkerd-destination deployment kubectl rollout restart deployment -n linkerd linkerd-destination

Logs, error output, etc

{"timestamp":"2022-06-14T10:39:49.418693Z","level":"WARN","fields":{"message":"{\"kind\":\"PodList\",\"apiVersion\":\"v1\" ... \"qosClass\":\"Burstable\"}}]}\n, Error(\"invalid type: null, expected a sequence\", line: 1, column: 2223366)"},"target":"kube::client","spans":[{"name":"pods"}]}
{"timestamp":"2022-06-14T10:39:49.460721Z","level":"INFO","fields":{"message":"Failed","error":"failed to perform initial object list: Error deserializing response"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}

In plain text:

Error("invalid type: null, expected a sequence", line: 1, column: 2223366)
failed to perform initial object list: Error deserializing response

output of linkerd check -o short

❯ linkerd check -o short
Linkerd core checks
===================

linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.11.1 but the latest stable version is 2.11.2
    see https://linkerd.io/2.11/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.11.1 but the latest stable version is 2.11.2
    see https://linkerd.io/2.11/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-76f9b7cccb-b7rdr (stable-2.11.1)
	* linkerd-destination-76f9b7cccb-gzvpv (stable-2.11.1)
	* linkerd-destination-76f9b7cccb-hxh8c (stable-2.11.1)
	* linkerd-identity-8448f698-h572b (stable-2.11.1)
	* linkerd-identity-8448f698-rfdxx (stable-2.11.1)
	* linkerd-identity-8448f698-xqlpg (stable-2.11.1)
	* linkerd-proxy-injector-85df7dd89-hfcwm (stable-2.11.1)
	* linkerd-proxy-injector-85df7dd89-tzmnh (stable-2.11.1)
	* linkerd-proxy-injector-85df7dd89-v9w7q (stable-2.11.1)
    see https://linkerd.io/2.11/checks/#l5d-cp-proxy-version for hints

Status check results are √

Linkerd extensions checks
=========================

linkerd-jaeger
--------------
‼ collector and jaeger service account exists
    missing ServiceAccounts: jaeger
    see https://linkerd.io/2.11/checks/#l5d-jaeger-sc-exists for hints

Status check results are √

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* grafana-5487ffc69d-jqjfj (stable-2.11.1)
	* metrics-api-65799f4f58-9hv66 (stable-2.11.1)
	* tap-54ddb4d68b-cf7pg (stable-2.11.1)
	* tap-54ddb4d68b-q8pm4 (stable-2.11.1)
	* tap-54ddb4d68b-zwc6p (stable-2.11.1)
	* tap-injector-5887f7db94-8f2s7 (stable-2.11.1)
	* web-75d7f664b-2jhj5 (stable-2.11.1)
    see https://linkerd.io/2.11/checks/#l5d-viz-proxy-cp-version for hints
‼ prometheus is installed and configured correctly
    missing ClusterRoles: linkerd-linkerd-viz-prometheus
    see https://linkerd.io/2.11/checks/#l5d-viz-prometheus for hints

Status check results are √

Environment

  • Kubernetes version: v1.21.11-gke.900
  • Environment: GKE
  • Host OS: Linux
  • Linkerd version: stable-2.11.1

Possible solution

To prevent a problem with one pod from breaking the whole mesh, Linkerd could skip pods with invalid values, while logging an error.

Additional context

The API reference for kubernetes 1.21 does not mention null as a valid value in ProjectedVolumeSource but still the pod object is created.

Would you like to work on fixing this bug?

No response

baracoder avatar Jun 14 '22 14:06 baracoder

This is most likely a problem that will have to be solved in https://github.com/Arnavion/k8s-openapi and, ultimately, in the Kubernetes API spec. Since the API spec does not describe the field as optional, deserializers that are derived from the API spec expect the field to be required.

This is similar to another issue we encountered https://github.com/kubernetes/kubernetes/issues/100802

To prevent a problem with one pod from breaking the whole mesh, Linkerd could skip pods with invalid values, while logging an error.

I'm not sure that we can realistically work around this in Linkerd--we don't actually handle decoding individual pod responses. Rather, the API clients throw an error about the whole API response. The best we could do is to update k8s-openapi to treat the field as optional.

olix0r avatar Jun 14 '22 15:06 olix0r

Potentially related to https://github.com/kubernetes/kubernetes/issues/93903

olix0r avatar Jun 14 '22 15:06 olix0r

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 28 '22 22:09 stale[bot]