linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

High CPU observed in linkerd-proxy

Open rajdhandus opened this issue 2 years ago • 4 comments

What is the issue?

We see high CPU for linkerd-proxy when scheduling new workloads. This is affecting the actual workloads.

How can it be reproduced?

  1. Deploy a cluster on 1.22.2 with 150 nodes
  2. Deploy linkerd 2.11.0 with helm chart
  3. Create workload and observe the CPU on linkerd proxies

Logs, error output, etc

https://gist.github.com/rajdhandus/0e9c6b99e35659a62fb1b7ee1add9e8b

output of linkerd check -o short

not able to run linkerd in the limited environment.

Control Plane Setup

image

Data Plane showing Proxies consuming high CPU

image

Environment

1.22.2 stable-2.11.0 linux

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

rajdhandus avatar May 04 '22 18:05 rajdhandus

How can it be reproduced?

  1. Deploy a cluster on 1.22.2 with 150 nodes
  2. Deploy linkerd 2.11.0 with helm chart
  3. Create workload and observe the CPU on linkerd proxies

This isn't enough for us to reproduce this issue. In my test cluster, for instance:

:; k top po --containers -A --sort-by=cpu  -l linkerd.io/control-plane-ns
NAMESPACE   POD                                       NAME             CPU(cores)   MEMORY(bytes)   
emojivoto   web-5f86686c4d-b26t6                      linkerd-proxy    19m          15Mi            
emojivoto   web-5f86686c4d-b26t6                      web-svc          11m          26Mi            
linkerd     linkerd-destination-78c45f45bf-k47p5      linkerd-proxy    25m          13Mi            
linkerd     linkerd-destination-78c45f45bf-k47p5      destination      1m           36Mi            
linkerd     linkerd-destination-78c45f45bf-k47p5      policy           1m           78Mi            
linkerd     linkerd-destination-78c45f45bf-k47p5      sp-validator     1m           25Mi            
emojivoto   emoji-696d9d8f95-qllpj                    linkerd-proxy    14m          11Mi            
emojivoto   emoji-696d9d8f95-qllpj                    emoji-svc        4m           24Mi            
emojivoto   voting-ff4c54b8d-mjvb5                    linkerd-proxy    12m          11Mi            
emojivoto   voting-ff4c54b8d-mjvb5                    voting-svc       3m           27Mi            
emojivoto   vote-bot-6d7677bb68-ngtmz                 linkerd-proxy    11m          14Mi            
emojivoto   vote-bot-6d7677bb68-ngtmz                 vote-bot         4m           21Mi            
linkerd     linkerd-identity-76cbc79c95-hfcnl         identity         1m           28Mi            
linkerd     linkerd-identity-76cbc79c95-hfcnl         linkerd-proxy    1m           11Mi            
linkerd     linkerd-proxy-injector-5c8b74d7f9-4mgq2   proxy-injector   1m           33Mi            
linkerd     linkerd-proxy-injector-5c8b74d7f9-4mgq2   linkerd-proxy    1m           14Mi  

For what it's worth, the Linkerd project doesn't have access to 150 node clusters. I'm doubtful that the number of nodes is particularly relevant to this, though.

It may be useful to see additional metrics about the workload via linkerd diagnostics proxy-metrics <pod>; but it will probably take profiling to understand what sort of issue you're encountering. If you can put together a standalone reproduction that we can run, we may be able to profile that.

olix0r avatar May 04 '22 19:05 olix0r

thanks for your response. we dont have access to linkerd cli in the restricted environment.. however, I was able to get the logs from the proxy.. i see some high response times for certain outbound calls.. can you please look and confirm that i am reading it correctly ?

https://gist.github.com/rajdhandus/0e9c6b99e35659a62fb1b7ee1add9e8b

for example:-

i see

tcp_read_bytes_total{direction="outbound",peer="dst",authority="kubernetes.default.svc.cluster.local:443",target_addr="52.191.88.164:443",target_ip="52.191.88.164",target_port="443",tls="no_identity",no_tls_reason="not_provided_by_service_discovery",dst_namespace="default",dst_service="kubernetes"} 677700

rajdhandus avatar May 04 '22 22:05 rajdhandus

i see some high response times for certain outbound calls.. can you please look and confirm that i am reading it correctly ?

for example:

tcp_read_bytes_total{direction="outbound",peer="dst",authority="kubernetes.default.svc.cluster.local:443",target_addr="52.191.88.164:443",target_ip="52.191.88.164",target_port="443",tls="no_identity",no_tls_reason="not_provided_by_service_discovery",dst_namespace="default",dst_service="kubernetes"} 677700

This metric, tcp_read_bytes_total, indicates this proxy has read ~661KB of data from the Kubernetes API. It doesn't indicate anything about response times.

Looking more broadly at the TCP read and write volumes:

tcp_read_bytes_total{direction="outbound",peer="src",target_addr="52.234.209.94:443",target_ip="52.234.209.94",target_port="443",tls="no_identity",no_tls_reason="loopback"} 912
tcp_read_bytes_total{direction="inbound",peer="src",target_addr="0.0.0.0:4191",target_ip="0.0.0.0",target_port="4191",tls="no_identity",no_tls_reason="no_tls_from_remote",srv_name="default:all-unauthenticated"} 8853
tcp_read_bytes_total{direction="outbound",peer="dst",authority="ccf-governor.00000000-0000-0000-0000-fad312e3c569.svc.cluster.local",target_addr="10.240.156.152:16383",target_ip="10.240.156.152",target_port="16383",tls="true",server_id="default.00000000-0000-0000-0000-fad312e3c569.serviceaccount.identity.linkerd.cluster.local",dst_control_plane_ns="linkerd",dst_deployment="ccf-governor",dst_namespace="00000000-0000-0000-0000-fad312e3c569",dst_pod="ccf-governor-5b5c8f6bb7-j9glj",dst_pod_template_hash="5b5c8f6bb7",dst_service="ccf-governor",dst_serviceaccount="default"} 1444
tcp_read_bytes_total{direction="outbound",peer="dst",authority="ccf-datastore.00000000-0000-0000-0000-fad312e3c569.svc.cluster.local",target_addr="10.240.160.102:16383",target_ip="10.240.160.102",target_port="16383",tls="true",server_id="default.00000000-0000-0000-0000-fad312e3c569.serviceaccount.identity.linkerd.cluster.local",dst_control_plane_ns="linkerd",dst_deployment="ccf-datastore",dst_namespace="00000000-0000-0000-0000-fad312e3c569",dst_pod="ccf-datastore-cbb58649b-mwgv2",dst_pod_template_hash="cbb58649b",dst_service="ccf-datastore",dst_serviceaccount="default"} 127378
tcp_read_bytes_total{direction="outbound",peer="dst",target_addr="52.234.209.94:443",target_ip="52.234.209.94",target_port="443",tls="no_identity",no_tls_reason="not_provided_by_service_discovery"} 7162
tcp_read_bytes_total{direction="outbound",peer="src",target_addr="10.0.223.127:80",target_ip="10.0.223.127",target_port="80",tls="no_identity",no_tls_reason="loopback"} 1569
tcp_read_bytes_total{direction="outbound",peer="src",target_addr="10.0.225.228:80",target_ip="10.0.225.228",target_port="80",tls="no_identity",no_tls_reason="loopback"} 906
tcp_write_bytes_total{direction="outbound",peer="src",target_addr="52.234.209.94:443",target_ip="52.234.209.94",target_port="443",tls="no_identity",no_tls_reason="loopback"} 4224
tcp_write_bytes_total{direction="inbound",peer="src",target_addr="0.0.0.0:4191",target_ip="0.0.0.0",target_port="4191",tls="no_identity",no_tls_reason="no_tls_from_remote",srv_name="default:all-unauthenticated"} 9414
tcp_write_bytes_total{direction="outbound",peer="dst",authority="ccf-governor.00000000-0000-0000-0000-fad312e3c569.svc.cluster.local",target_addr="10.240.156.152:16383",target_ip="10.240.156.152",target_port="16383",tls="true",server_id="default.00000000-0000-0000-0000-fad312e3c569.serviceaccount.identity.linkerd.cluster.local",dst_control_plane_ns="linkerd",dst_deployment="ccf-governor",dst_namespace="00000000-0000-0000-0000-fad312e3c569",dst_pod="ccf-governor-5b5c8f6bb7-j9glj",dst_pod_template_hash="5b5c8f6bb7",dst_service="ccf-governor",dst_serviceaccount="default"} 239
tcp_write_bytes_total{direction="outbound",peer="dst",authority="ccf-datastore.00000000-0000-0000-0000-fad312e3c569.svc.cluster.local",target_addr="10.240.160.102:16383",target_ip="10.240.160.102",target_port="16383",tls="true",server_id="default.00000000-0000-0000-0000-fad312e3c569.serviceaccount.identity.linkerd.cluster.local",dst_control_plane_ns="linkerd",dst_deployment="ccf-datastore",dst_namespace="00000000-0000-0000-0000-fad312e3c569",dst_pod="ccf-datastore-cbb58649b-mwgv2",dst_pod_template_hash="cbb58649b",dst_service="ccf-datastore",dst_serviceaccount="default"} 901
tcp_write_bytes_total{direction="outbound",peer="dst",target_addr="52.234.209.94:443",target_ip="52.234.209.94",target_port="443",tls="no_identity",no_tls_reason="not_provided_by_service_discovery"} 912
tcp_write_bytes_total{direction="outbound",peer="src",target_addr="10.0.223.127:80",target_ip="10.0.223.127",target_port="80",tls="no_identity",no_tls_reason="loopback"} 128289
tcp_write_bytes_total{direction="outbound",peer="src",target_addr="10.0.225.228:80",target_ip="10.0.225.228",target_port="80",tls="no_identity",no_tls_reason="loopback"} 2406

None of these indicate a particularly high volume of data being transferred. The most active connection has only transmitted <200KB.

Response latencies are represented as bucketed histograms like:

response_latency_ms_bucket{direction="outbound",authority="ccf-governor.00000000-0000-0000-0000-fad312e3c569.svc.cluster.local",target_addr="10.240.156.152:16383",target_ip="10.240.156.152",target_port="16383",tls="true",server_id="default.00000000-0000-0000-0000-fad312e3c569.serviceaccount.identity.linkerd.cluster.local",dst_control_plane_ns="linkerd",dst_deployment="ccf-governor",dst_namespace="00000000-0000-0000-0000-fad312e3c569",dst_pod="ccf-governor-5b5c8f6bb7-j9glj",dst_pod_template_hash="5b5c8f6bb7",dst_service="ccf-governor",dst_serviceaccount="default",status_code="200",le="200"} 0
response_latency_ms_bucket{direction="outbound",authority="ccf-governor.00000000-0000-0000-0000-fad312e3c569.svc.cluster.local",target_addr="10.240.156.152:16383",target_ip="10.240.156.152",target_port="16383",tls="true",server_id="default.00000000-0000-0000-0000-fad312e3c569.serviceaccount.identity.linkerd.cluster.local",dst_control_plane_ns="linkerd",dst_deployment="ccf-governor",dst_namespace="00000000-0000-0000-0000-fad312e3c569",dst_pod="ccf-governor-5b5c8f6bb7-j9glj",dst_pod_template_hash="5b5c8f6bb7",dst_service="ccf-governor",dst_serviceaccount="default",status_code="200",le="300"} 1
response_latency_ms_bucket{direction="outbound",authority="ccf-governor.00000000-0000-0000-0000-fad312e3c569.svc.cluster.local",target_addr="10.240.156.152:16383",target_ip="10.240.156.152",target_port="16383",tls="true",server_id="default.00000000-0000-0000-0000-fad312e3c569.serviceaccount.identity.linkerd.cluster.local",dst_control_plane_ns="linkerd",dst_deployment="ccf-governor",dst_namespace="00000000-0000-0000-0000-fad312e3c569",dst_pod="ccf-governor-5b5c8f6bb7-j9glj",dst_pod_template_hash="5b5c8f6bb7",dst_service="ccf-governor",dst_serviceaccount="default",status_code="200",le="+Inf"} 1

This indicates that there's been exactly one response from the ccf-governor service and its response latency was between 200 and 300 milliseconds.

But the proxy's latency and its CPU usage are probably not exactly correlated--high CPU usage may impact latency but high latency generally cannot cause CPU usage to increase. And we're trying to figure out where the proxy is spending time on the CPU.

Looking generally at the TCP-level data you have:

Another useful metric may be:

process_cpu_seconds_total 264.48

This indicates that the proxy has used <5 minutes of CPU time since it started; this data is not really useful without watching this as a rate over time, though.

There's nothing else in the metrics you shared to indicate where the proxy is spending time, unfortunately.

The last thing you can try would be to upgrade to the latest stable version, stable-2.11.2. I'm not aware of any CPU usage related issues we've addressed recently, though; so I'm not confident that this will resolve your issue.

olix0r avatar May 05 '22 17:05 olix0r

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 03 '22 21:08 stale[bot]