kube-state-metrics
kube-state-metrics copied to clipboard
kube-state-metrics - write: broken pipe
What happened: After noticing some "anomaliles" in some metrics, did a little Googling and checked the kube-state-metrics pod log. Found numerous messages like the following:
E1205 05:04:35.900320 1 metrics_handler.go:223] "Failed to write EOF directive" err="write tcp 10.244.42.152:8080->10.244.101.147:35432: write: broken pipe"
E1205 05:05:37.269362 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write metrics family: write tcp 10.244.42.152:8080->10.244.101.147:40672: write: broken pipe"
E1205 05:05:37.269391 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.244.42.152:8080->10.244.101.147:40672: write: broken pipe"
E1205 05:05:37.269401 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.244.42.152:8080->10.244.101.147:40672: write: broken pipe"
E1205 05:05:37.269410 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.244.42.152:8080->10.244.101.147:40672: write: broken pipe"
E1205 05:05:37.269418 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.244.42.152:8080->10.244.101.147:40672: write: broken pipe"
E1205 05:05:37.269425 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.244.42.152:8080->10.244.101.147:40672: write: broken pipe"
E1205 05:05:37.269434 1 metrics_handler.go:223] "Failed to write EOF directive" err="write tcp 10.244.42.152:8080->10.244.101.147:40672: write: broken pipe"
What you expected to happen: Not really sure, but I am sure these error messages are not good.
How to reproduce it (as minimally and precisely as possible):
I am not sure how to reproduce this, as I don't know what causes it to occur. I found it in the pod logs, then decided to restart the pod. Unfortunately this blew away the old logs and it has yet to appear in the new pod logs.
Anything else we need to know?:
Environment:
- kube-state-metrics version:
I used the kube-prometheus-stack Helm chart (https://github.com/prometheus-community/helm-charts/) to install Prometheus, Grafana, etc. to our Kubernetes cluster. The Helm chart version is 52.1.0. After inspecting the Helm chart dependencies, I believe kube-state-metrics is version 5.14.0.
I did a quick inspection of the pod spec and found the following:
containers:
- name: kube-state-metrics
image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.10.0
args:
- '--port=8080'
- >-
--resources=certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments
- Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T18:03:20Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.6", GitCommit:"741c8db18a52787d734cbe4795f0b4ad860906d6", GitTreeState:"clean", BuildDate:"2023-09-13T09:14:09Z", GoVersion:"go1.20.8", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.22) and server (1.27) exceeds the supported minor version skew of +/-1
- Cloud provider or hardware configuration:
Our K8s cluster is an on-premises installation. The cluster nodes are running Ubuntu 20.04.
- Other info:
/triage accepted /assign @dgrisonnet
@logicalhan @dgrisonnet any estimation when are you planning fixing it? i face same issue with the same version.
E0202 22:02:08.428816 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:41082: write: broken pipe" E0205 04:56:32.728805 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write metrics family: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.728835 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.728857 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.728870 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.728888 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.928694 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.928740 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.928757 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0207 09:31:40.528764 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write metrics family: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528815 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528829 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528848 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528861 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528875 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528894 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe"
I don't have any bandwidth to investigate right now.
@CatherineF-dev do you perhaps have some time to take a look at this?
Ok
@bengoldenberg09
- Could you help provide detailed steps on reproducing this issue? Thx
- Does latest KSM still have this issue?
this is my deployment:
apiVersion: apps/v1 kind: Deployment metadata: name: kube-state-metrics namespace: monitor labels: app.kubernetes.io/component: exporter app.kubernetes.io/name: kube-state-metrics spec: replicas: 1 selector: matchLabels: app.kubernetes.io/name: kube-state-metrics template: metadata: creationTimestamp: null labels: app.kubernetes.io/component: exporter app.kubernetes.io/name: kube-state-metrics annotations: karpenter.sh/do-not-evict: 'true' prometheus.io/path: /metrics prometheus.io/port: '8080' prometheus.io/probe: 'true' prometheus.io/scrape: 'true' spec: containers: - name: kube-state-metrics image: >- registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.10.0 ports: - name: http-metrics containerPort: 8080 protocol: TCP - name: telemetry containerPort: 8081 protocol: TCP resources: limits: cpu: 10m memory: 100Mi requests: cpu: 10m memory: 100Mi livenessProbe: httpGet: path: /healthz port: 8080 scheme: HTTP initialDelaySeconds: 5 timeoutSeconds: 5 periodSeconds: 10 successThreshold: 1 failureThreshold: 3 readinessProbe: httpGet: path: / port: 8081 scheme: HTTP initialDelaySeconds: 5 timeoutSeconds: 5 periodSeconds: 10 successThreshold: 1 failureThreshold: 3 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: IfNotPresent securityContext: privileged: false runAsUser: 65534 runAsNonRoot: false readOnlyRootFilesystem: false allowPrivilegeEscalation: true restartPolicy: Always terminationGracePeriodSeconds: 30 dnsPolicy: ClusterFirst nodeSelector: kubernetes.io/os: linux serviceAccountName: kube-state-metrics serviceAccount: kube-state-metrics automountServiceAccountToken: true shareProcessNamespace: false securityContext: {} schedulerName: default-scheduler tolerations: - key: CriticalAddonsOnly operator: Exists enableServiceLinks: true strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 25% maxSurge: 25% revisionHistoryLimit: 10 progressDeadlineSeconds: 600
so i have the latest one.
and it is happening from time to time.
when i am running the deployment this was my logs.
I0207 13:17:41.754839 1 server.go:72] levelinfomsgListening onaddress[::]:8081 I0207 13:17:41.854588 1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress[::]:8081 I0207 13:17:53.456710 1 trace.go:236] Trace[169429371]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.455) (total time: 12001ms): Trace[169429371]: ---"Objects listed" error:<nil> 11001ms (13:17:52.456) Trace[169429371]: [12.001555192s] [12.001555192s] END I0207 13:17:54.557245 1 trace.go:236] Trace[330206141]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.454) (total time: 13102ms): Trace[330206141]: ---"Objects listed" error:<nil> 13102ms (13:17:54.557) Trace[330206141]: [13.102365236s] [13.102365236s] END I0207 13:18:11.657805 1 trace.go:236] Trace[1483466474]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.455) (total time: 30202ms): Trace[1483466474]: ---"Objects listed" error:<nil> 30200ms (13:18:11.655) Trace[1483466474]: [30.202597991s] [30.202597991s] END I0207 13:18:13.356858 1 trace.go:236] Trace[1575017887]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.454) (total time: 31901ms): Trace[1575017887]: ---"Objects listed" error:<nil> 31900ms (13:18:13.355) Trace[1575017887]: [31.901926607s] [31.901926607s] END I0207 13:18:16.554560 1 trace.go:236] Trace[1245028553]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.455) (total time: 35099ms): Trace[1245028553]: ---"Objects listed" error:<nil> 18700ms (13:18:00.155) Trace[1245028553]: ---"SyncWith done" 16398ms (13:18:16.554) Trace[1245028553]: [35.099403448s] [35.099403448s] END I0207 13:18:17.655904 1 trace.go:236] Trace[1419539666]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.455) (total time: 36200ms): Trace[1419539666]: ---"Objects listed" error:<nil> 33899ms (13:18:15.355) Trace[1419539666]: ---"SyncWith done" 2300ms (13:18:17.655) Trace[1419539666]: [36.200643647s] [36.200643647s] END E0207 15:01:13.464963 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write metrics family: write tcp 10.11.34.120:8080->10.11.20.121:52334: write: broken pipe" E0207 15:01:13.465072 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:52334: write: broken pipe" E0207 20:02:35.555637 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write metrics family: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557513 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557550 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557563 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557582 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557683 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557698 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe"
i am using victoria metrics for monitoring.
We're also using VictoriaMetrics and see the same issue. Scraping component is "VMAgent".
image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.10.0
We are seeing this always when the cluster is heavily loaded with a bazillion of Pods running and other churn.
We use k-s-m metrics to identify nodes in a JOIN and the metrics always drop out when the churn rises. So it must be related to "amount of stuff happening" for lack of a more cogent description.
See drop-outs here:
Correlated error logs:
Facing the same issue, current KSM version: v2.8.2
Logs:
E0219 03:57:11.745859 1 metrics_handler.go:213] "Failed to write metrics" err="failed to write help text: write tcp 1xx.xxx.63.7:8080->1xx.168.34.xx:55444: write: broken pipe"
E0219 03:57:11.745869 1 metrics_handler.go:213] "Failed to write metrics" err="failed to write help text: write tcp 1xx.xxx.63.7:8080->1xx.168.34.xx:55444: write: broken pipe"
E0219 03:57:11.745878 1 metrics_handler.go:213] "Failed to write metrics" err="failed to write help text: write tcp 1xx.xxx.63.7:8080->1xx.168.34.xx:55444: write: broken pipe"
KSM version: v2.10.0
I have been observing the same when there are a lot of pods in the cluster(over 5K). This works properly when the number of pods are lower(under 500).
qq: could anyone help provide detailed steps to reproduce this issue? Thx!
cc @bengoldenberg09, could you paste the deployment yaml again? The above code formatting messed up.
cc @naweeng, do you remember how to reproduce?
This error is thrown from here https://github.com/kubernetes/kube-state-metrics/blob/v2.8.2/pkg/metricshandler/metrics_handler.go#L210-L215 with error write: broken pipe
or golang write: connection reset by peer
// w http.ResponseWriter
for _, w := range m.metricsWriters {
err := w.WriteAll(writer)
if err != nil {
klog.ErrorS(err, "Failed to write metrics")
}
}
io.Writer .Write()
- https://github.com/kubernetes/kube-state-metrics/blob/main/pkg/metrics_store/metrics_writer.go#L49-L83
I feel this might be related to go version. Could you try v2.9.0+ which uses golang 1.9? cc @naweeng
I feel this might be related to go version. Could you try v2.9.0+ which uses golang 1.9? cc @naweeng
Building with go 1.19, still the same error @CatherineF-dev
Can you check the scrape_duration_seconds for the job that scrapes it? If there's a broken pipe, the TCP/HTTP connection might get terminated early.
@bxy4543, could you help reproduce with detailed steps?
im seeing the same issue on our dev cluster (which is fairly large)
Can you check the scrape_duration_seconds for the job that scrapes it? If there's a broken pipe, the TCP/HTTP connection might get terminated early.
scrape_duration_seconds{cluster="cluster-name",container="kube-state-metrics",endpoint="http",instance="xxx:8080",job="kube-state-metrics",namespace="vm",pod="victoria-metrics-k8s-stack-kube-state-metrics-847ddd64-splbf",prometheus="vm/victoria-metrics-k8s-stack",service="victoria-metrics-k8s-stack-kube-state-metrics"} 1.602
@bxy4543, could you help reproduce with detailed steps?
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "4"
meta.helm.sh/release-name: victoria-metrics-k8s-stack
meta.helm.sh/release-namespace: vm
labels:
app.kubernetes.io/component: metrics
app.kubernetes.io/instance: victoria-metrics-k8s-stack
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/part-of: kube-state-metrics
app.kubernetes.io/version: 2.10.0
helm.sh/chart: kube-state-metrics-5.12.1
name: victoria-metrics-k8s-stack-kube-state-metrics
namespace: vm
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/instance: victoria-metrics-k8s-stack
app.kubernetes.io/name: kube-state-metrics
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/component: metrics
app.kubernetes.io/instance: victoria-metrics-k8s-stack
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/part-of: kube-state-metrics
app.kubernetes.io/version: 2.10.0
helm.sh/chart: kube-state-metrics-5.12.1
spec:
containers:
- args:
- --port=8080
- --resources=certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments
image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.10.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: kube-state-metrics
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources: {}
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 65534
runAsGroup: 65534
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
serviceAccount: victoria-metrics-k8s-stack-kube-state-metrics
serviceAccountName: victoria-metrics-k8s-stack-kube-state-metrics
terminationGracePeriodSeconds: 30
@CatherineF-dev
Can you check the scrape_duration_seconds for the job that scrapes it? If there's a broken pipe, the TCP/HTTP connection might get terminated early.
I found the cause of the problem: Since some data could still be obtained, I did not check the target status. When I checked the target status, I found this error:
cannot read stream body in 1 seconds: the response from "http://xxx:8080/metrics" exceeds -promscrape.maxScrapeSize=16777216; either reduce the response size for the target or increase -promscrape.maxScrapeSize
So I expanded the vmagent parameter promscrape.maxScrapeSize
(default is 16777216), and no more errors were reported
Thx!
cannot read stream body in 1 seconds: the response from "http://xxx:8080/metrics" exceeds -promscrape.maxScrapeSize=16777216; either reduce the response size for the target or increase -promscrape.maxScrapeSize
qq: where did you find this error log? Is it from prometheus?
qq: where did you find this error log? Is it from prometheus?
I am using victoria-metrics, which I found out from the vmagent page target.
@jgagnon44 @decipher27 @naweeng @towolf could you try the above solution So I expanded the vmagent parameter promscrape.maxScrapeSize (default is 16777216), and no more errors were reported
?
If no issues, will close it.
/close
@CatherineF-dev: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
What's the equivalent of vmagent parameter promscrape.maxScrapeSize
in Prometheus? If it is body_size_limit
, it's default to 0, which is no limit. I'm still having the same issue, let me know if you folks were able to implement such solution and see results.
Please reopen this issue. It occurs in Prometheus
/reopen
@CatherineF-dev: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
We increased the cpu/memory resource limit in kubestatemetrics pod. Now we are not seeing this error. We also tried increasing scrape interval but then we reverted back that change as it didnt cause any impact.