linkerd2 Policy Controller Memory Leak

What is the issue?

We recently upgraded Linkerd in our staging environment from 2.12.4 (Helm chart version 1.9.6) to 2.14.8 (Helm chart version 1.16.9). After this upgrade, we noticed some cluster instability; upon investigating, it turns out that the policy container in just one of our five destination pod replicas was leaking memory at a constant rate of around 25 MB per minute. We have since updated to the latest patch version (2.14.9/1.16.10) with the same result.

The primary issue here is the memory leak itself: you can see from the chart below that the memory usage on the policy container holding the policy-controller-write lease will grow until that pod is terminated for exceeding its memory limit. (We're running a five-replica HA deployment.) I've attached a small extract of trace logs from that container; it seems to be processing HttpRoute updates at very high volume - the logs are basically just this on repeat for different namespaces/services - but I don't know if that's the intended behavior.

The secondary issue is that in our deployment of Linkerd, the policy container was not defaulting back to the destination pod resource requests/limits as this line in the Helm template would imply - in fact it had no requests or limits applied, which meant it would eventually grow to 13GB+ and cause the node it was located on to fail. We solved this issue in the short term by applying resources specifically to the policy controller via policyController.resources so it correctly kills just the one pod now, but if I'm understanding the templates correctly it never should have had the issue in the first place.

I did generally notice that after the upgrade, the baseline CPU and memory consumption of the policy containers (including non-leaseholders) is much higher - the CPU usage is at least 20x higher (although still small in the grand scheme of things, around 70ms for the leaseholder and 35ms for non-leaseholders) and the memory usage is around 5x higher (around 150 MB, up from 30MB before the update). Skimming the release notes, it does look like the policy controller is doing a lot more these days than it used to, so again - that may be expected.

I will happily pull any resource definitions or additional logs from our staging cluster as required - just let me know what you'd like to see.

Thanks!

How can it be reproduced?

Unclear; since this is unreported elsewhere it's likely an edge case we're just unlucky enough to have hit.

Logs, error output, etc

2024-02-18T17:21:22.344432Z TRACE hyper::proto::h1::decode: decode; state=Chunked(Size, 0)
2024-02-18T17:21:22.344459Z TRACE hyper::proto::h1::decode: Read chunk hex size
2024-02-18T17:21:22.344500Z TRACE hyper::proto::h1::io: received 1640 bytes
2024-02-18T17:21:22.344505Z TRACE hyper::proto::h1::decode: Read chunk hex size
2024-02-18T17:21:22.344508Z TRACE hyper::proto::h1::decode: Read chunk hex size
2024-02-18T17:21:22.344510Z TRACE hyper::proto::h1::decode: Read chunk hex size
2024-02-18T17:21:22.344513Z TRACE hyper::proto::h1::decode: Chunk size is 1633
2024-02-18T17:21:22.344517Z DEBUG hyper::proto::h1::decode: incoming chunked header: 0x661 (1633 bytes)
2024-02-18T17:21:22.344521Z TRACE hyper::proto::h1::decode: Chunked read, remaining=1633
2024-02-18T17:21:22.344529Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Body(Chunked(BodyCr, 0)), writing: KeepAlive, keep_alive: Busy }
2024-02-18T17:21:22.344555Z TRACE httproutes.policy.linkerd.io: tokio_util::codec::framed_impl: attempting to decode a frame
2024-02-18T17:21:22.344673Z TRACE httproutes.policy.linkerd.io: tokio_util::codec::framed_impl: frame decoded from buffer
2024-02-18T17:21:22.344858Z TRACE httproutes.policy.linkerd.io: kubert::index: event=Applied(HttpRoute { metadata: ObjectMeta { annotations: Some({"meta.helm.sh/release-name": "mfgx-customer1", "meta.helm.sh/release-namespace": "customer1"}), cluster_name: None, creation_timestamp: Some(Time(2024-01-16T17:02:10Z)), deletion_grace_period_seconds: None, deletion_timestamp: None, finalizers: None, generate_name: None, generation: Some(1), labels: Some({"app.kubernetes.io/managed-by": "Helm"}), managed_fields: Some([ManagedFieldsEntry { api_version: Some("policy.linkerd.io/v1beta1"), fields_type: Some("FieldsV1"), fields_v1: Some(FieldsV1(Object {"f:metadata": Object {"f:annotations": Object {".": Object {}, "f:meta.helm.sh/release-name": Object {}, "f:meta.helm.sh/release-namespace": Object {}}, "f:labels": Object {".": Object {}, "f:app.kubernetes.io/managed-by": Object {}}}, "f:spec": Object {".": Object {}, "f:parentRefs": Object {}, "f:rules": Object {}}})), manager: Some("helm"), operation: Some("Update"), time: Some(Time(2024-01-16T17:02:10Z)) }, ManagedFieldsEntry { api_version: Some("policy.linkerd.io/v1beta3"), fields_type: Some("FieldsV1"), fields_v1: Some(FieldsV1(Object {"f:status": Object {".": Object {}, "f:parents": Object {}}})), manager: Some("policy.linkerd.io"), operation: Some("Update"), time: Some(Time(2024-02-18T17:21:22Z)) }]), name: Some("mfgx-backend-svc-orchestration-worker-primary-route"), namespace: Some("customer1"), owner_references: None, resource_version: Some("1495930533"), self_link: None, uid: Some("51097cb5-02ce-4a45-a73b-721661b4c637") }, spec: HttpRouteSpec { inner: CommonRouteSpec { parent_refs: Some([ParentReference { group: Some("policy.linkerd.io"), kind: Some("Server"), namespace: None, name: "mfgx-backend-svc-orchestration-worker-primary", section_name: None, port: None }]) }, hostnames: None, rules: Some([HttpRouteRule { matches: Some([HttpRouteMatch { path: Some(PathPrefix { value: "/" }), headers: None, query_params: None, method: None }]), filters: None, backend_refs: None, timeouts: None }]) }, status: Some(HttpRouteStatus { inner: RouteStatus { parents: [RouteParentStatus { parent_ref: ParentReference { group: Some("policy.linkerd.io"), kind: Some("Server"), namespace: Some("customer1"), name: "mfgx-backend-svc-orchestration-worker-primary", section_name: None, port: None }, controller_name: "linkerd.io/policy-controller", conditions: [Condition { last_transition_time: Time(2024-02-18T17:09:55Z), message: "", observed_generation: None, reason: "Accepted", status: "True", type_: "Accepted" }] }] } }) })
2024-02-18T17:21:22.345021Z DEBUG httproutes.policy.linkerd.io: linkerd_policy_controller_k8s_index::outbound::index: indexing route name="mfgx-backend-svc-orchestration-worker-primary-route"
2024-02-18T17:21:22.345092Z TRACE httproutes.policy.linkerd.io: tokio_util::codec::framed_impl: attempting to decode a frame
2024-02-18T17:21:22.345108Z TRACE hyper::proto::h1::decode: decode; state=Chunked(BodyCr, 0)
2024-02-18T17:21:22.345113Z TRACE hyper::proto::h1::decode: Read chunk hex size
2024-02-18T17:21:22.345176Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Body(Chunked(Size, 0)), writing: KeepAlive, keep_alive: Busy }
2024-02-18T17:21:22.345199Z TRACE hyper::proto::h1::conn: Conn::read_head
2024-02-18T17:21:22.345254Z TRACE hyper::proto::h1::io: received 1980 bytes
2024-02-18T17:21:22.345265Z TRACE parse_headers: hyper::proto::h1::role: Response.parse bytes=1980
2024-02-18T17:21:22.345274Z TRACE parse_headers: hyper::proto::h1::role: Response.parse Complete(376)
2024-02-18T17:21:22.345293Z DEBUG hyper::proto::h1::io: parsed 8 headers
2024-02-18T17:21:22.345307Z DEBUG hyper::proto::h1::conn: incoming body is content-length (1604 bytes)
2024-02-18T17:21:22.345318Z TRACE hyper::proto::h1::decode: decode; state=Length(1604)
2024-02-18T17:21:22.345322Z DEBUG hyper::proto::h1::conn: incoming body completed
2024-02-18T17:21:22.345330Z TRACE hyper::proto::h1::conn: maybe_notify; read_from_io blocked
2024-02-18T17:21:22.345385Z TRACE want: signal: Want    
2024-02-18T17:21:22.345396Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Init, writing: Init, keep_alive: Idle }
2024-02-18T17:21:22.345405Z TRACE want: signal: Want    
2024-02-18T17:21:22.345408Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Init, writing: Init, keep_alive: Idle }
2024-02-18T17:21:22.345427Z TRACE status::Controller:HTTP{http.method=PATCH http.url=https://10.100.0.1/apis/policy.linkerd.io/v1beta3/namespaces/customer1/httproutes/mfgx-backend-svc-orchestration-worker-primary-route/status?&fieldManager=policy.linkerd.io otel.name="patch_status" otel.kind="client"}: hyper::client::pool: put; add idle connection for ("https", 10.100.0.1)
2024-02-18T17:21:22.345445Z DEBUG status::Controller:HTTP{http.method=PATCH http.url=https://10.100.0.1/apis/policy.linkerd.io/v1beta3/namespaces/customer1/httproutes/mfgx-backend-svc-orchestration-worker-primary-route/status?&fieldManager=policy.linkerd.io otel.name="patch_status" otel.kind="client"}: hyper::client::pool: pooling idle connection for ("https", 10.100.0.1)
2024-02-18T17:21:22.345569Z TRACE status::Controller: tower::buffer::service: sending request to buffer worker
2024-02-18T17:21:22.345580Z TRACE tower::buffer::worker: worker polling for next message
2024-02-18T17:21:22.345584Z TRACE tower::buffer::worker: processing new request
2024-02-18T17:21:22.345588Z TRACE status::Controller: tower::buffer::worker: resumed=false worker received request; waiting for service readiness
2024-02-18T17:21:22.345593Z DEBUG status::Controller: tower::buffer::worker: service.ready=true processing request
2024-02-18T17:21:22.345605Z TRACE status::Controller: tower::buffer::worker: returning response future
2024-02-18T17:21:22.345610Z TRACE tower::buffer::worker: worker polling for next message
2024-02-18T17:21:22.345631Z DEBUG status::Controller:HTTP{http.method=PATCH http.url=https://10.100.0.1/apis/policy.linkerd.io/v1beta3/namespaces/customer2/httproutes/mfgx-backend-svc-package-management-health-route/status?&fieldManager=policy.linkerd.io otel.name="patch_status" otel.kind="client"}: kube_client::client::builder: requesting
2024-02-18T17:21:22.345642Z TRACE status::Controller:HTTP{http.method=PATCH http.url=https://10.100.0.1/apis/policy.linkerd.io/v1beta3/namespaces/customer2/httproutes/mfgx-backend-svc-package-management-health-route/status?&fieldManager=policy.linkerd.io otel.name="patch_status" otel.kind="client"}: hyper::client::pool: take? ("https", 10.100.0.1): expiration = Some(90s)
2024-02-18T17:21:22.345651Z DEBUG status::Controller:HTTP{http.method=PATCH http.url=https://10.100.0.1/apis/policy.linkerd.io/v1beta3/namespaces/customer2/httproutes/mfgx-backend-svc-package-management-health-route/status?&fieldManager=policy.linkerd.io otel.name="patch_status" otel.kind="client"}: hyper::client::pool: reuse idle connection for ("https", 10.100.0.1)
2024-02-18T17:21:22.345671Z TRACE encode_headers: hyper::proto::h1::role: Client::encode method=PATCH, body=Some(Known(491))
2024-02-18T17:21:22.346010Z TRACE hyper::proto::h1::io: buffer.flatten self.len=1400 buf.len=491
2024-02-18T17:21:22.346070Z DEBUG hyper::proto::h1::io: flushed 1891 bytes
2024-02-18T17:21:22.346075Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Init, writing: KeepAlive, keep_alive: Busy }
2024-02-18T17:21:22.353433Z TRACE hyper::proto::h1::decode: decode; state=Chunked(Size, 0)
2024-02-18T17:21:22.353461Z TRACE hyper::proto::h1::decode: Read chunk hex size
2024-02-18T17:21:22.353507Z TRACE hyper::proto::h1::io: received 1656 bytes
2024-02-18T17:21:22.353514Z TRACE hyper::proto::h1::decode: Read chunk hex size
2024-02-18T17:21:22.353517Z TRACE hyper::proto::h1::decode: Read chunk hex size
2024-02-18T17:21:22.353519Z TRACE hyper::proto::h1::decode: Read chunk hex size
2024-02-18T17:21:22.353523Z TRACE hyper::proto::h1::decode: Chunk size is 1649
2024-02-18T17:21:22.353528Z DEBUG hyper::proto::h1::decode: incoming chunked header: 0x671 (1649 bytes)
2024-02-18T17:21:22.353532Z TRACE hyper::proto::h1::decode: Chunked read, remaining=1649
2024-02-18T17:21:22.353541Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Body(Chunked(BodyCr, 0)), writing: KeepAlive, keep_alive: Busy }
2024-02-18T17:21:22.353565Z TRACE httproutes.policy.linkerd.io: tokio_util::codec::framed_impl: attempting to decode a frame
2024-02-18T17:21:22.353574Z TRACE httproutes.policy.linkerd.io: tokio_util::codec::framed_impl: attempting to decode a frame
2024-02-18T17:21:22.353649Z TRACE httproutes.policy.linkerd.io: tokio_util::codec::framed_impl: frame decoded from buffer
2024-02-18T17:21:22.353901Z TRACE httproutes.policy.linkerd.io: kubert::index: event=Applied(HttpRoute { metadata: ObjectMeta { annotations: Some({"meta.helm.sh/release-name": "mfgx-customer2", "meta.helm.sh/release-namespace": "customer2"}), cluster_name: None, creation_timestamp: Some(Time(2023-05-11T15:21:09Z)), deletion_grace_period_seconds: None, deletion_timestamp: None, finalizers: None, generate_name: None, generation: Some(1), labels: Some({"app.kubernetes.io/managed-by": "Helm"}), managed_fields: Some([ManagedFieldsEntry { api_version: Some("policy.linkerd.io/v1beta1"), fields_type: Some("FieldsV1"), fields_v1: Some(FieldsV1(Object {"f:metadata": Object {"f:annotations": Object {".": Object {}, "f:meta.helm.sh/release-name": Object {}, "f:meta.helm.sh/release-namespace": Object {}}, "f:labels": Object {".": Object {}, "f:app.kubernetes.io/managed-by": Object {}}}, "f:spec": Object {".": Object {}, "f:parentRefs": Object {}, "f:rules": Object {}}})), manager: Some("Go-http-client"), operation: Some("Update"), time: Some(Time(2023-05-11T15:21:09Z)) }, ManagedFieldsEntry { api_version: Some("policy.linkerd.io/v1beta3"), fields_type: Some("FieldsV1"), fields_v1: Some(FieldsV1(Object {"f:status": Object {".": Object {}, "f:parents": Object {}}})), manager: Some("policy.linkerd.io"), operation: Some("Update"), time: Some(Time(2024-02-18T17:21:22Z)) }]), name: Some("mfgx-backend-svc-package-management-health-route"), namespace: Some("customer2"), owner_references: None, resource_version: Some("1495930534"), self_link: None, uid: Some("db54b4e7-ddb3-460f-9e46-901672337e31") }, spec: HttpRouteSpec { inner: CommonRouteSpec { parent_refs: Some([ParentReference { group: Some("policy.linkerd.io"), kind: Some("Server"), namespace: None, name: "mfgx-backend-svc-package-management-primary", section_name: None, port: None }]) }, hostnames: None, rules: Some([HttpRouteRule { matches: Some([HttpRouteMatch { path: Some(PathPrefix { value: "/health" }), headers: None, query_params: None, method: Some("GET") }]), filters: None, backend_refs: None, timeouts: None }]) }, status: Some(HttpRouteStatus { inner: RouteStatus { parents: [RouteParentStatus { parent_ref: ParentReference { group: Some("policy.linkerd.io"), kind: Some("Server"), namespace: Some("customer2"), name: "mfgx-backend-svc-package-management-primary", section_name: None, port: None }, controller_name: "linkerd.io/policy-controller", conditions: [Condition { last_transition_time: Time(2024-02-18T17:09:55Z), message: "", observed_generation: None, reason: "Accepted", status: "True", type_: "Accepted" }] }] } }) })
2024-02-18T17:21:22.354067Z DEBUG httproutes.policy.linkerd.io: linkerd_policy_controller_k8s_index::outbound::index: indexing route name="mfgx-backend-svc-package-management-health-route"
2024-02-18T17:21:22.354166Z TRACE httproutes.policy.linkerd.io: tokio_util::codec::framed_impl: attempting to decode a frame
2024-02-18T17:21:22.354189Z TRACE hyper::proto::h1::decode: decode; state=Chunked(BodyCr, 0)
2024-02-18T17:21:22.354195Z TRACE hyper::proto::h1::decode: Read chunk hex size
2024-02-18T17:21:22.354211Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Body(Chunked(Size, 0)), writing: KeepAlive, keep_alive: Busy }
2024-02-18T17:21:22.354234Z TRACE hyper::proto::h1::conn: Conn::read_head
2024-02-18T17:21:22.354271Z TRACE hyper::proto::h1::io: received 1996 bytes
2024-02-18T17:21:22.354280Z TRACE parse_headers: hyper::proto::h1::role: Response.parse bytes=1996
2024-02-18T17:21:22.354290Z TRACE parse_headers: hyper::proto::h1::role: Response.parse Complete(376)
2024-02-18T17:21:22.354309Z DEBUG hyper::proto::h1::io: parsed 8 headers
2024-02-18T17:21:22.354317Z DEBUG hyper::proto::h1::conn: incoming body is content-length (1620 bytes)
2024-02-18T17:21:22.354327Z TRACE hyper::proto::h1::decode: decode; state=Length(1620)
2024-02-18T17:21:22.354333Z DEBUG hyper::proto::h1::conn: incoming body completed
2024-02-18T17:21:22.354498Z TRACE hyper::proto::h1::conn: maybe_notify; read_from_io blocked
2024-02-18T17:21:22.354548Z TRACE want: signal: Want    
2024-02-18T17:21:22.354555Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Init, writing: Init, keep_alive: Idle }
2024-02-18T17:21:22.354562Z TRACE want: signal: Want    
2024-02-18T17:21:22.354566Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Init, writing: Init, keep_alive: Idle }
2024-02-18T17:21:22.354582Z TRACE status::Controller:HTTP{http.method=PATCH http.url=https://10.100.0.1/apis/policy.linkerd.io/v1beta3/namespaces/customer2/httproutes/mfgx-backend-svc-package-management-health-route/status?&fieldManager=policy.linkerd.io otel.name="patch_status" otel.kind="client"}: hyper::client::pool: put; add idle connection for ("https", 10.100.0.1)
2024-02-18T17:21:22.354592Z DEBUG status::Controller:HTTP{http.method=PATCH http.url=https://10.100.0.1/apis/policy.linkerd.io/v1beta3/namespaces/customer2/httproutes/mfgx-backend-svc-package-management-health-route/status?&fieldManager=policy.linkerd.io otel.name="patch_status" otel.kind="client"}: hyper::client::pool: pooling idle connection for ("https", 10.100.0.1)
2024-02-18T17:21:22.354694Z TRACE status::Controller: tower::buffer::service: sending request to buffer worker
2024-02-18T17:21:22.354709Z TRACE tower::buffer::worker: worker polling for next message
2024-02-18T17:21:22.354713Z TRACE tower::buffer::worker: processing new request
2024-02-18T17:21:22.354718Z TRACE status::Controller: tower::buffer::worker: resumed=false worker received request; waiting for service readiness
2024-02-18T17:21:22.354723Z DEBUG status::Controller: tower::buffer::worker: service.ready=true processing request
2024-02-18T17:21:22.354733Z TRACE status::Controller: tower::buffer::worker: returning response future
2024-02-18T17:21:22.354770Z TRACE tower::buffer::worker: worker polling for next message
2024-02-18T17:21:22.354795Z DEBUG status::Controller:HTTP{http.method=PATCH http.url=https://10.100.0.1/apis/policy.linkerd.io/v1beta3/namespaces/customer3/httproutes/mfgx-backend-svc-package-management-primary-route/status?&fieldManager=policy.linkerd.io otel.name="patch_status" otel.kind="client"}: kube_client::client::builder: requesting
2024-02-18T17:21:22.354809Z TRACE status::Controller:HTTP{http.method=PATCH http.url=https://10.100.0.1/apis/policy.linkerd.io/v1beta3/namespaces/customer3/httproutes/mfgx-backend-svc-package-management-primary-route/status?&fieldManager=policy.linkerd.io otel.name="patch_status" otel.kind="client"}: hyper::client::pool: take? ("https", 10.100.0.1): expiration = Some(90s)
2024-02-18T17:21:22.354817Z DEBUG status::Controller:HTTP{http.method=PATCH http.url=https://10.100.0.1/apis/policy.linkerd.io/v1beta3/namespaces/customer3/httproutes/mfgx-backend-svc-package-management-primary-route/status?&fieldManager=policy.linkerd.io otel.name="patch_status" otel.kind="client"}: hyper::client::pool: reuse idle connection for ("https", 10.100.0.1)
2024-02-18T17:21:22.354912Z TRACE encode_headers: hyper::proto::h1::role: Client::encode method=PATCH, body=Some(Known(491))
2024-02-18T17:21:22.354935Z TRACE hyper::proto::h1::io: buffer.flatten self.len=1400 buf.len=491
2024-02-18T17:21:22.354961Z DEBUG hyper::proto::h1::io: flushed 1891 bytes
2024-02-18T17:21:22.354965Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Init, writing: KeepAlive, keep_alive: Busy }

output of `linkerd check -o short`

$ linkerd check -o short --linkerd-namespace kube-system
Status check results are √

Environment

Kubernetes version: 1.26 Cluster environment: EKS Host OS: Amazon Linux (RHEL) Linkerd vesrion: 2.14.9

Possible solution

No response

Additional context

Memory usage graph:

Kube manifest for one of the HttpRoute resources in question:

Name:         mfgx-backend-svc-transformation-health-route
Namespace:    customer1
Labels:       app.kubernetes.io/managed-by=Helm
Annotations:  meta.helm.sh/release-name: mfgx-customer1
              meta.helm.sh/release-namespace: customer1
API Version:  policy.linkerd.io/v1beta3
Kind:         HTTPRoute
Metadata:
  Creation Timestamp:  2023-05-11T15:17:57Z
  Generation:          1
  Resource Version:    1496020292
  UID:                 a18469f8-4b5c-4bcb-b790-69ce2cbb8a19
Spec:
  Parent Refs:
    Group:  policy.linkerd.io
    Kind:   Server
    Name:   mfgx-backend-svc-transformation-primary
  Rules:
    Matches:
      Method:  GET
      Path:
        Type:   PathPrefix
        Value:  /health
Status:
  Parents:
    Conditions:
      Last Transition Time:  2024-02-18T17:28:01Z
      Message:
      Reason:                Accepted
      Status:                True
      Type:                  Accepted
    Controller Name:         linkerd.io/policy-controller
    Parent Ref:
      Group:      policy.linkerd.io
      Kind:       Server
      Name:       mfgx-backend-svc-transformation-primary
      Namespace:  customer1
Events:           <none>

Would you like to work on fixing this bug?

None

Feb 18 '24 17:02 lance-menard

Same problem here. Setup:

Kubernetes 1.27.2 (OCI)
Red Hat Enterprise Linux 8.9
Linkerd Version 2.14.9 Step by step we reached memory resource limit of 4GiB for destination and linkerd-proxy containers. Still getting frequent OOMs.

Feb 19 '24 06:02 mbinder42

Hi @lance-menard, thanks for this great and detailed report.

The logs you provided are interesting and it would be great to get the full log (or as much as is available) rather than just a snippet. This will allow us to see if its the same HttpRoute resources getting processed over and over and and/or how those resources change each time they get reindexed.

It would also be hugely useful to provide a dump of the control plane metrics when the policy controller is using a high amount of memory (ideally shortly before it OOMs). These can be collected by running linkerd diagnostics control-plane-metrics. Note that this will collect metrics from all 5 replicas, which is great because it can help us compare the healthy replicas to the one which is leaking memory.

Happy to help if any of this is unclear!

Feb 21 '24 21:02 adleong

I've raised #12131 to separately track the issue with the policy controller manifest.

Feb 22 '24 15:02 alpeb

A full log file (at least, as much as Kube will export) is attached - as an indication of how much churning the container is doing, it's about 4.5MB of logs over a time span of under 7 seconds.

I've also attached the controller metrics dump, run when the pod was nearly at max memory, and a policy diagnostics dump from one of the pods listed in the logs. At the time of the dump, the leaseholder was linkerd-destination-98958d7d7-mtmq9.

All files have been redacted to remove customer names - unfortunately the redacted customer numbers don't line up between the files. I've retained the unredacted files so I can help correlate if required.

Over the past few days we have experienced two 15-20 minute service mesh outages, seemingly during leaseholder change, during which all meshed pods in the cluster (which is basically all of them) return 403 errors for all traffic. As a result, we chose to downgrade back to 12.2.4 on our staging environment while we investigate. Interestingly, those 403 errors also happened during the downgrade, which hasn't happened during any previous Linkerd deployment. We do still have a development environment with the latest version installed we can use for debugging.

controller-metrics-redacted.txt policy-controller-full-log-export-redacted.log policy-diagnostics.txt

Let me know if there's anything else I can do to help - thanks!

Feb 23 '24 22:02 lance-menard

linkerd2 linkerd2 copied to clipboard

Policy Controller Memory Leak

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

linkerd2
linkerd2 copied to clipboard

output of `linkerd check -o short`