linkerd2
linkerd2 copied to clipboard
High CPU observed in linkerd-proxy
What is the issue?
We see high CPU for linkerd-proxy when scheduling new workloads. This is affecting the actual workloads.
How can it be reproduced?
- Deploy a cluster on 1.22.2 with 150 nodes
- Deploy linkerd 2.11.0 with helm chart
- Create workload and observe the CPU on linkerd proxies
Logs, error output, etc
https://gist.github.com/rajdhandus/0e9c6b99e35659a62fb1b7ee1add9e8b
output of linkerd check -o short
not able to run linkerd in the limited environment.
Control Plane Setup
Data Plane showing Proxies consuming high CPU
Environment
1.22.2 stable-2.11.0 linux
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
No response
How can it be reproduced?
- Deploy a cluster on 1.22.2 with 150 nodes
- Deploy linkerd 2.11.0 with helm chart
- Create workload and observe the CPU on linkerd proxies
This isn't enough for us to reproduce this issue. In my test cluster, for instance:
:; k top po --containers -A --sort-by=cpu -l linkerd.io/control-plane-ns
NAMESPACE POD NAME CPU(cores) MEMORY(bytes)
emojivoto web-5f86686c4d-b26t6 linkerd-proxy 19m 15Mi
emojivoto web-5f86686c4d-b26t6 web-svc 11m 26Mi
linkerd linkerd-destination-78c45f45bf-k47p5 linkerd-proxy 25m 13Mi
linkerd linkerd-destination-78c45f45bf-k47p5 destination 1m 36Mi
linkerd linkerd-destination-78c45f45bf-k47p5 policy 1m 78Mi
linkerd linkerd-destination-78c45f45bf-k47p5 sp-validator 1m 25Mi
emojivoto emoji-696d9d8f95-qllpj linkerd-proxy 14m 11Mi
emojivoto emoji-696d9d8f95-qllpj emoji-svc 4m 24Mi
emojivoto voting-ff4c54b8d-mjvb5 linkerd-proxy 12m 11Mi
emojivoto voting-ff4c54b8d-mjvb5 voting-svc 3m 27Mi
emojivoto vote-bot-6d7677bb68-ngtmz linkerd-proxy 11m 14Mi
emojivoto vote-bot-6d7677bb68-ngtmz vote-bot 4m 21Mi
linkerd linkerd-identity-76cbc79c95-hfcnl identity 1m 28Mi
linkerd linkerd-identity-76cbc79c95-hfcnl linkerd-proxy 1m 11Mi
linkerd linkerd-proxy-injector-5c8b74d7f9-4mgq2 proxy-injector 1m 33Mi
linkerd linkerd-proxy-injector-5c8b74d7f9-4mgq2 linkerd-proxy 1m 14Mi
For what it's worth, the Linkerd project doesn't have access to 150 node clusters. I'm doubtful that the number of nodes is particularly relevant to this, though.
It may be useful to see additional metrics about the workload via linkerd diagnostics proxy-metrics <pod>
; but it will probably take profiling to understand what sort of issue you're encountering. If you can put together a standalone reproduction that we can run, we may be able to profile that.
thanks for your response. we dont have access to linkerd cli in the restricted environment.. however, I was able to get the logs from the proxy.. i see some high response times for certain outbound calls.. can you please look and confirm that i am reading it correctly ?
https://gist.github.com/rajdhandus/0e9c6b99e35659a62fb1b7ee1add9e8b
for example:-
i see
tcp_read_bytes_total{direction="outbound",peer="dst",authority="kubernetes.default.svc.cluster.local:443",target_addr="52.191.88.164:443",target_ip="52.191.88.164",target_port="443",tls="no_identity",no_tls_reason="not_provided_by_service_discovery",dst_namespace="default",dst_service="kubernetes"} 677700
i see some high response times for certain outbound calls.. can you please look and confirm that i am reading it correctly ?
for example:
tcp_read_bytes_total{direction="outbound",peer="dst",authority="kubernetes.default.svc.cluster.local:443",target_addr="52.191.88.164:443",target_ip="52.191.88.164",target_port="443",tls="no_identity",no_tls_reason="not_provided_by_service_discovery",dst_namespace="default",dst_service="kubernetes"} 677700
This metric, tcp_read_bytes_total
, indicates this proxy has read ~661KB of data from the Kubernetes API. It doesn't indicate anything about response times.
Looking more broadly at the TCP read and write volumes:
tcp_read_bytes_total{direction="outbound",peer="src",target_addr="52.234.209.94:443",target_ip="52.234.209.94",target_port="443",tls="no_identity",no_tls_reason="loopback"} 912
tcp_read_bytes_total{direction="inbound",peer="src",target_addr="0.0.0.0:4191",target_ip="0.0.0.0",target_port="4191",tls="no_identity",no_tls_reason="no_tls_from_remote",srv_name="default:all-unauthenticated"} 8853
tcp_read_bytes_total{direction="outbound",peer="dst",authority="ccf-governor.00000000-0000-0000-0000-fad312e3c569.svc.cluster.local",target_addr="10.240.156.152:16383",target_ip="10.240.156.152",target_port="16383",tls="true",server_id="default.00000000-0000-0000-0000-fad312e3c569.serviceaccount.identity.linkerd.cluster.local",dst_control_plane_ns="linkerd",dst_deployment="ccf-governor",dst_namespace="00000000-0000-0000-0000-fad312e3c569",dst_pod="ccf-governor-5b5c8f6bb7-j9glj",dst_pod_template_hash="5b5c8f6bb7",dst_service="ccf-governor",dst_serviceaccount="default"} 1444
tcp_read_bytes_total{direction="outbound",peer="dst",authority="ccf-datastore.00000000-0000-0000-0000-fad312e3c569.svc.cluster.local",target_addr="10.240.160.102:16383",target_ip="10.240.160.102",target_port="16383",tls="true",server_id="default.00000000-0000-0000-0000-fad312e3c569.serviceaccount.identity.linkerd.cluster.local",dst_control_plane_ns="linkerd",dst_deployment="ccf-datastore",dst_namespace="00000000-0000-0000-0000-fad312e3c569",dst_pod="ccf-datastore-cbb58649b-mwgv2",dst_pod_template_hash="cbb58649b",dst_service="ccf-datastore",dst_serviceaccount="default"} 127378
tcp_read_bytes_total{direction="outbound",peer="dst",target_addr="52.234.209.94:443",target_ip="52.234.209.94",target_port="443",tls="no_identity",no_tls_reason="not_provided_by_service_discovery"} 7162
tcp_read_bytes_total{direction="outbound",peer="src",target_addr="10.0.223.127:80",target_ip="10.0.223.127",target_port="80",tls="no_identity",no_tls_reason="loopback"} 1569
tcp_read_bytes_total{direction="outbound",peer="src",target_addr="10.0.225.228:80",target_ip="10.0.225.228",target_port="80",tls="no_identity",no_tls_reason="loopback"} 906
tcp_write_bytes_total{direction="outbound",peer="src",target_addr="52.234.209.94:443",target_ip="52.234.209.94",target_port="443",tls="no_identity",no_tls_reason="loopback"} 4224
tcp_write_bytes_total{direction="inbound",peer="src",target_addr="0.0.0.0:4191",target_ip="0.0.0.0",target_port="4191",tls="no_identity",no_tls_reason="no_tls_from_remote",srv_name="default:all-unauthenticated"} 9414
tcp_write_bytes_total{direction="outbound",peer="dst",authority="ccf-governor.00000000-0000-0000-0000-fad312e3c569.svc.cluster.local",target_addr="10.240.156.152:16383",target_ip="10.240.156.152",target_port="16383",tls="true",server_id="default.00000000-0000-0000-0000-fad312e3c569.serviceaccount.identity.linkerd.cluster.local",dst_control_plane_ns="linkerd",dst_deployment="ccf-governor",dst_namespace="00000000-0000-0000-0000-fad312e3c569",dst_pod="ccf-governor-5b5c8f6bb7-j9glj",dst_pod_template_hash="5b5c8f6bb7",dst_service="ccf-governor",dst_serviceaccount="default"} 239
tcp_write_bytes_total{direction="outbound",peer="dst",authority="ccf-datastore.00000000-0000-0000-0000-fad312e3c569.svc.cluster.local",target_addr="10.240.160.102:16383",target_ip="10.240.160.102",target_port="16383",tls="true",server_id="default.00000000-0000-0000-0000-fad312e3c569.serviceaccount.identity.linkerd.cluster.local",dst_control_plane_ns="linkerd",dst_deployment="ccf-datastore",dst_namespace="00000000-0000-0000-0000-fad312e3c569",dst_pod="ccf-datastore-cbb58649b-mwgv2",dst_pod_template_hash="cbb58649b",dst_service="ccf-datastore",dst_serviceaccount="default"} 901
tcp_write_bytes_total{direction="outbound",peer="dst",target_addr="52.234.209.94:443",target_ip="52.234.209.94",target_port="443",tls="no_identity",no_tls_reason="not_provided_by_service_discovery"} 912
tcp_write_bytes_total{direction="outbound",peer="src",target_addr="10.0.223.127:80",target_ip="10.0.223.127",target_port="80",tls="no_identity",no_tls_reason="loopback"} 128289
tcp_write_bytes_total{direction="outbound",peer="src",target_addr="10.0.225.228:80",target_ip="10.0.225.228",target_port="80",tls="no_identity",no_tls_reason="loopback"} 2406
None of these indicate a particularly high volume of data being transferred. The most active connection has only transmitted <200KB.
Response latencies are represented as bucketed histograms like:
response_latency_ms_bucket{direction="outbound",authority="ccf-governor.00000000-0000-0000-0000-fad312e3c569.svc.cluster.local",target_addr="10.240.156.152:16383",target_ip="10.240.156.152",target_port="16383",tls="true",server_id="default.00000000-0000-0000-0000-fad312e3c569.serviceaccount.identity.linkerd.cluster.local",dst_control_plane_ns="linkerd",dst_deployment="ccf-governor",dst_namespace="00000000-0000-0000-0000-fad312e3c569",dst_pod="ccf-governor-5b5c8f6bb7-j9glj",dst_pod_template_hash="5b5c8f6bb7",dst_service="ccf-governor",dst_serviceaccount="default",status_code="200",le="200"} 0
response_latency_ms_bucket{direction="outbound",authority="ccf-governor.00000000-0000-0000-0000-fad312e3c569.svc.cluster.local",target_addr="10.240.156.152:16383",target_ip="10.240.156.152",target_port="16383",tls="true",server_id="default.00000000-0000-0000-0000-fad312e3c569.serviceaccount.identity.linkerd.cluster.local",dst_control_plane_ns="linkerd",dst_deployment="ccf-governor",dst_namespace="00000000-0000-0000-0000-fad312e3c569",dst_pod="ccf-governor-5b5c8f6bb7-j9glj",dst_pod_template_hash="5b5c8f6bb7",dst_service="ccf-governor",dst_serviceaccount="default",status_code="200",le="300"} 1
response_latency_ms_bucket{direction="outbound",authority="ccf-governor.00000000-0000-0000-0000-fad312e3c569.svc.cluster.local",target_addr="10.240.156.152:16383",target_ip="10.240.156.152",target_port="16383",tls="true",server_id="default.00000000-0000-0000-0000-fad312e3c569.serviceaccount.identity.linkerd.cluster.local",dst_control_plane_ns="linkerd",dst_deployment="ccf-governor",dst_namespace="00000000-0000-0000-0000-fad312e3c569",dst_pod="ccf-governor-5b5c8f6bb7-j9glj",dst_pod_template_hash="5b5c8f6bb7",dst_service="ccf-governor",dst_serviceaccount="default",status_code="200",le="+Inf"} 1
This indicates that there's been exactly one response from the ccf-governor service and its response latency was between 200 and 300 milliseconds.
But the proxy's latency and its CPU usage are probably not exactly correlated--high CPU usage may impact latency but high latency generally cannot cause CPU usage to increase. And we're trying to figure out where the proxy is spending time on the CPU.
Looking generally at the TCP-level data you have:
Another useful metric may be:
process_cpu_seconds_total 264.48
This indicates that the proxy has used <5 minutes of CPU time since it started; this data is not really useful without watching this as a rate over time, though.
There's nothing else in the metrics you shared to indicate where the proxy is spending time, unfortunately.
The last thing you can try would be to upgrade to the latest stable version, stable-2.11.2
. I'm not aware of any CPU usage related issues we've addressed recently, though; so I'm not confident that this will resolve your issue.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.