linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Policy controller noisy connection logging

Open xolott opened this issue 8 months ago • 5 comments

What is the issue?

The policy controller fails when it tries to watch some k8s resources (I think all of them). There is not a single package dropped according to cilium (I used hubble to check this) but the controller says the connection was dropped. Using curl within the container I can make the same GET request to the API server and get a response, so the CNI is not dropping this connection.

$ curl -v --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&wat
ch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531

* Request completely sent off
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
< HTTP/2 200 
< audit-id: 568e2098-a3a2-4216-9503-3a8b7cdb1050
< cache-control: no-cache, private
< content-type: application/json
< x-kubernetes-pf-flowschema-uid: f2596b47-c5e3-439f-a6e7-a84503cb1560
< x-kubernetes-pf-prioritylevel-uid: c65e2c9b-924d-4ec0-848a-49025cdd9c73
< content-length: 165
< date: Wed, 19 Jun 2024 22:33:07 GMT
< 
{
  "apiVersion": "policy.linkerd.io/v1beta3",
  "items": [],
  "kind": "HTTPRouteList",
  "metadata": {
    "continue": "",
    "resourceVersion": "226670468"
  }
* Connection #0 to host 10.0.0.1 left intact
}

How can it be reproduced?

Using terraform:

resource "helm_release" "cilium" {
  name       = "cilium"
  repository = "https://helm.cilium.io/"
  chart      = "cilium"
  version    = var.cilium_version
  namespace  = "kube-system"

  set {
    name  = "aksbyocni.enabled"
    value = true
  }

  set {
    name  = "nodeinit.enabled"
    value = true
  }

  set {
    name  = "hubble.relay.enabled"
    value = true
  }

  set {
    name  = "hubble.ui.enabled"
    value = true
  }
}

resource "kubernetes_namespace" "linkerd_namespace" {
  metadata {
    name = "linkerd"
  }
}
resource "kubernetes_namespace" "linkerd_viz_namespace" {
  metadata {
    name = "linkerd-viz"
    labels = {
      "linkerd.io/extension" = "viz"
    }
  }
}

resource "kubernetes_namespace" "linkerd_cni_namespace" {
  metadata {
    name = "linkerd-cni"
    labels = {
      "config.linkerd.io/admission-webhooks" = "disabled"
      "linkerd.io/cni-resource"              = "true"
    }
  }
}

resource "helm_release" "linkerd_cni" {
  depends_on = [kubernetes_namespace.linkerd_namespace]
  name       = "linkerd-cni"
  repository = "https://helm.linkerd.io/edge"
  chart      = "linkerd2-cni"
  namespace  = "linkerd-cni"
  version    = "2024.3.1"
  wait       = true
}

resource "helm_release" "linkerd_crds" {
  depends_on = [kubernetes_namespace.linkerd_namespace, helm_release.linkerd_cni]
  name       = "linkerd-crds"
  repository = "https://helm.linkerd.io/edge"
  chart      = "linkerd-crds"
  namespace  = "linkerd"
  version    = "2024.3.1"
  wait       = true

  set {
    name  = "cniEnabled"
    value = true
  }
}

resource "helm_release" "linkerd" {
  depends_on = [helm_release.linkerd_crds]
  name       = "linkerd-control-plane"
  repository = "https://helm.linkerd.io/edge"
  chart      = "linkerd-control-plane"
  namespace  = "linkerd"
  version    = "2024.3.1"
  wait       = true

  values = [
    "${file("${path.module}/values-ha.yaml")}"
  ]

  set {
    name  = "controllerReplicas"
    value = var.replicas
  }

  set {
    name  = "cniEnabled"
    value = true
  }

  set {
    name  = "identity.externalCA"
    value = true
  }

  set {
    name  = "identity.issuer.scheme"
    value = "kubernetes.io/tls"
  }

  set {
    name  = "disableHeartBeat"
    value = true
  }

  set {
    name  = "webhookFailurePolicy"
    value = "Fail"
  }

  set {
    name  = "networkValidator.connectAddr"
    value = "0.0.0.0:4140"
  }
}

resource "helm_release" "linker-viz" {
  depends_on = [helm_release.linkerd_crds, module.linkerd_viz_policy]
  name       = "linkerd-viz"
  repository = "https://helm.linkerd.io/edge"
  chart      = "linkerd-viz"
  namespace  = "linkerd-viz"
  version    = "2024.6.1"
  wait       = true
}

values-ha.yaml

# This values.yaml file contains the values needed to enable HA mode.
# Usage:
#   helm install -f values-ha.yaml

# -- Create PodDisruptionBudget resources for each control plane workload
enablePodDisruptionBudget: true

controller:
  # -- sets pod disruption budget parameter for all deployments
  podDisruptionBudget:
    # -- Maximum number of pods that can be unavailable during disruption
    maxUnavailable: 1

# -- Specify a deployment strategy for each control plane workload
deploymentStrategy:
  rollingUpdate:
    maxUnavailable: 1
    maxSurge: 25%

# -- add PodAntiAffinity to each control plane workload
enablePodAntiAffinity: true

# nodeAffinity:

# proxy configuration
proxy:
  resources:
    cpu:
      request: 100m
    memory:
      limit: 250Mi
      request: 20Mi

# controller configuration
controllerReplicas: 3
controllerResources: &controller_resources
  cpu: &controller_resources_cpu
    limit: ""
    request: 100m
  memory:
    limit: 250Mi
    request: 50Mi
destinationResources: *controller_resources

# identity configuration
identityResources:
  cpu: *controller_resources_cpu
  memory:
    limit: 250Mi
    request: 10Mi

# heartbeat configuration
heartbeatResources: *controller_resources

# proxy injector configuration
proxyInjectorResources: *controller_resources
webhookFailurePolicy: Fail

# service profile validator configuration
spValidatorResources: *controller_resources

# flag for linkerd check
highAvailability: true

Logs, error output, etc

Every few minutes, the policy controller logs:

2024-06-19T22:29:04.143485Z  INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-19T22:29:04.367419Z  INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-19T22:29:04.586154Z  INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-19T22:29:05.138944Z  INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-19T22:29:05.401101Z  INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-19T22:29:06.185944Z  INFO httproutes.policy.linkerd.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-19T22:29:06.443435Z  INFO services: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)

Running it with a debug log level, I can see this:


2024-06-14T23:01:36.098797Z DEBUG Connection{peer=Server}: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(37), flags: (0x4: END_HEADERS) }
2024-06-14T23:01:36.098872Z DEBUG Connection{peer=Server}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(37) }
2024-06-14T23:01:36.098892Z DEBUG Connection{peer=Server}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(37), flags: (0x1: END_STREAM) }
2024-06-14T23:01:36.099070Z DEBUG Connection{peer=Server}: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(37), flags: (0x5: END_HEADERS | END_STREAM) }
2024-06-14T23:01:36.840186Z DEBUG status::Index: linkerd_policy_controller_k8s_status::index: Lease holder reconciling cluster index.name=linkerd-destination-bcdf65946-6spv7
2024-06-14T23:01:39.458677Z DEBUG hyper::proto::h1::conn: incoming body decode error: Connection reset by peer (os error 104)
2024-06-14T23:01:39.458754Z DEBUG rustls::common_state: Sending warning alert CloseNotify    
2024-06-14T23:01:39.458781Z DEBUG hyper::proto::h1::conn: error shutting down IO: Broken pipe (os error 32)
2024-06-14T23:01:39.458903Z DEBUG hyper::client::client: client connection error: error shutting down connection: Broken pipe (os error 32)
2024-06-14T23:01:39.458914Z DEBUG servers: kube_runtime::watcher: watcher error: ReadEvents(Custom { kind: Other, error: hyper::Error(Body, hyper::Error(Body, Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" })) })
2024-06-14T23:01:39.458950Z  INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-14T23:01:39.459034Z DEBUG servers: tower::buffer::worker: service.ready=true processing request
2024-06-14T23:01:39.459097Z DEBUG servers:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta2/servers?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: kube_client::client::builder: requesting
2024-06-14T23:01:39.459151Z DEBUG servers:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta2/servers?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: hyper::client::pool: reuse idle connection for ("https", 10.0.0.1)
2024-06-14T23:01:39.459450Z DEBUG hyper::proto::h1::io: flushed 1953 bytes
2024-06-14T23:01:39.460952Z DEBUG hyper::proto::h1::io: parsed 7 headers
2024-06-14T23:01:39.460968Z DEBUG hyper::proto::h1::conn: incoming body is chunked encoding
2024-06-14T23:01:39.825269Z DEBUG hyper::proto::h1::conn: incoming body decode error: Connection reset by peer (os error 104)
2024-06-14T23:01:39.825333Z DEBUG rustls::common_state: Sending warning alert CloseNotify    
2024-06-14T23:01:39.825365Z DEBUG hyper::proto::h1::conn: error shutting down IO: Broken pipe (os error 32)
2024-06-14T23:01:39.825475Z DEBUG hyper::client::client: client connection error: error shutting down connection: Broken pipe (os error 32)
2024-06-14T23:01:39.827989Z DEBUG httproutes.policy.linkerd.io: kube_runtime::watcher: watcher error: ReadEvents(Custom { kind: Other, error: hyper::Error(Body, hyper::Error(Body, Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" })) })
2024-06-14T23:01:39.828036Z  INFO httproutes.policy.linkerd.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-14T23:01:39.828120Z DEBUG httproutes.policy.linkerd.io: tower::buffer::worker: service.ready=true processing request
2024-06-14T23:01:39.828184Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: kube_client::client::builder: requesting
2024-06-14T23:01:39.828270Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: hyper::client::connect::http: connecting to 10.0.0.1:443
2024-06-14T23:01:39.829996Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: hyper::client::connect::http: connected to 10.0.0.1:443
2024-06-14T23:01:39.830113Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: rustls::client::hs: Resuming session    
2024-06-14T23:01:39.845683Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: rustls::client::hs: Using ciphersuite TLS13_AES_128_GCM_SHA256    
2024-06-14T23:01:39.845723Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: rustls::client::tls13: Not resuming    
2024-06-14T23:01:39.845844Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: rustls::client::tls13: TLS1.3 encrypted extensions: []    
2024-06-14T23:01:39.845855Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: rustls::client::hs: ALPN protocol is None    
2024-06-14T23:01:39.845877Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: rustls::client::tls13: Got CertificateRequest CertificateRequestPayloadTLS13 { context: , extensions: [Unknown(UnknownExtension { typ: StatusRequest, payload:  }), Unknown(UnknownExtension { typ: SCT, payload:  }), SignatureAlgorithms([RSA_PSS_SHA256, ECDSA_NISTP256_SHA256, ED25519, RSA_PSS_SHA384, RSA_PSS_SHA512, RSA_PKCS1_SHA256, RSA_PKCS1_SHA384, RSA_PKCS1_SHA512, ECDSA_NISTP384_SHA384, ECDSA_NISTP521_SHA512, RSA_PKCS1_SHA1, ECDSA_SHA1_Legacy]), AuthorityNames([DistinguishedName(300d310b3009060355040313026361), DistinguishedName(3011310f300d060355040313066167672d6361)])] }    
2024-06-14T23:01:39.845932Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: rustls::client::common: Client auth requested but no cert/sigscheme available    
2024-06-14T23:01:39.846540Z DEBUG hyper::proto::h1::io: flushed 1956 bytes
2024-06-14T23:01:39.848708Z DEBUG hyper::proto::h1::io: parsed 7 headers
2024-06-14T23:01:39.848724Z DEBUG hyper::proto::h1::conn: incoming body is chunked encoding
2024-06-14T23:01:40.045298Z DEBUG hyper::proto::h1::conn: read eof
2024-06-14T23:01:41.039244Z DEBUG hyper::proto::h1::role: setting h1 header read timeout timer
2024-06-14T23:01:41.039253Z DEBUG hyper::proto::h1::role: setting h1 header read timeout timer
2024-06-14T23:01:41.039320Z DEBUG hyper::proto::h1::io: parsed 3 headers
2024-06-14T23:01:41.039329Z DEBUG hyper::proto::h1::conn: incoming body is empty
2024-06-14T23:01:41.039357Z DEBUG hyper::proto::h1::io: parsed 3 headers
2024-06-14T23:01:41.039372Z DEBUG hyper::proto::h1::conn: incoming body is empty
2024-06-14T23:01:41.039442Z DEBUG hyper::proto::h1::io: flushed 107 bytes
2024-06-14T23:01:41.040019Z DEBUG hyper::proto::h1::io: flushed 107 bytes
2024-06-14T23:01:41.752835Z DEBUG hyper::proto::h1::conn: incoming body decode error: Connection reset by peer (os error 104)
2024-06-14T23:01:41.752898Z DEBUG rustls::common_state: Sending warning alert CloseNotify    
2024-06-14T23:01:41.752927Z DEBUG hyper::proto::h1::conn: error shutting down IO: Broken pipe (os error 32)

output of linkerd check -o short

----------------
‼ trust anchors are valid for at least 60 days
    Anchors expiring soon:
	* 329674952538096284210311201227367316359 root.linkerd.cluster.local will expire on 2024-06-19T23:29:52Z
    see https://linkerd.io/2.13/checks/#l5d-identity-trustAnchors-not-expiring-soon for hints
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2024-06-19T23:16:07Z
    see https://linkerd.io/2.13/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-version
---------------
‼ cli is up-to-date
    unsupported version channel: stable-2.13.6
    see https://linkerd.io/2.13/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 24.3.1 but the latest edge version is 24.6.2
    see https://linkerd.io/2.13/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-24.3.1 but cli running stable-2.13.6
    see https://linkerd.io/2.13/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-669ccf4f8-rx85b (edge-24.3.1)
	* linkerd-destination-669ccf4f8-ssqd2 (edge-24.3.1)
	* linkerd-identity-7f6f95b45c-sndb2 (edge-24.3.1)
	* linkerd-identity-7f6f95b45c-zhcck (edge-24.3.1)
	* linkerd-proxy-injector-6466cbdf79-9ftwx (edge-24.3.1)
	* linkerd-proxy-injector-6466cbdf79-n5gcs (edge-24.3.1)
    see https://linkerd.io/2.13/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-669ccf4f8-rx85b running edge-24.3.1 but cli running stable-2.13.6
    see https://linkerd.io/2.13/checks/#l5d-cp-proxy-cli-version for hints

linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
    kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
    see https://linkerd.io/2.13/checks/#l5d-injection-disabled for hints

linkerd-viz
-----------
‼ viz extension proxies are healthy
    Some pods do not have the current trust bundle and must be restarted:
	* metrics-api-5f994ffcd4-gt62w
	* prometheus-54f9848dd8-pllbj
	* tap-c745745f5-bptpg
	* tap-injector-5b5898c87-j5tl7
	* web-7c948c6b48-hzsz4
    see https://linkerd.io/2.13/checks/#l5d-viz-proxy-healthy for hints
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-5f994ffcd4-gt62w (edge-24.3.1)
	* prometheus-54f9848dd8-pllbj (edge-24.3.1)
	* tap-c745745f5-bptpg (edge-24.3.1)
	* tap-injector-5b5898c87-j5tl7 (edge-24.3.1)
	* web-7c948c6b48-hzsz4 (edge-24.3.1)
    see https://linkerd.io/2.13/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-5f994ffcd4-gt62w running edge-24.3.1 but cli running stable-2.13.6
    see https://linkerd.io/2.13/checks/#l5d-viz-proxy-cli-version for hints

Environment

  • Cilium version: 1.13.17
  • Linkerd version: edge-24.3.1
  • Kubernetes version: 1.28.3
  • Cloud environment: Azure (AKS) with BYO network plugin

Possible solution

No response

Additional context

I modified the policy controller container image by adding: ls, wget, sh and curl

Would you like to work on fixing this bug?

None

xolott avatar Jun 19 '24 22:06 xolott