opentelemetry-collector
opentelemetry-collector copied to clipboard
scrape/scrape.go:1313 Scrape commit failed "error": "data refused due to high memory usage"
Describe the bug ota collector can't scrape pod metrics
Steps to reproduce Configure prometheus exporter with prometheus endpoint
What did you expect to see? Scrape metrics from pods and receive it to 'prometheusremotewrite'
What did you see instead? error scrape/scrape.go:1313 Scrape commit failed {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "otel_kubernetes_podscraper", "target": "http://ip:port/metrics", "error": "data refused due to high memory usage"}
What version did you use? Version: 0.82
What config did you use? Config:
... prometheus: endpoint: 0.0.0.0:port metric_expiration: 120m resource_to_telemetry_conversion: enabled: true send_timestamps: true prometheusremotewrite: endpoint: http://hostname/prometheus/api/v1/write extensions: health_check: {} memory_ballast: {} processors: batch: {} memory_limiter: check_interval: 3s limit_mib: 6553 spike_limit_mib: 2048 ...
Environment k8s Pod (from helm chart) with Limit 8G 2cores
I'm having the same problem with prometheus receiver performing scrapes since the container is started and have tried different collector versions and settings:
resources:
requests:
cpu: 500m
memory: 4096Mi
limits:
memory: 4096Mi
processors:
memory_limiter:
check_interval: 1s
limit_mib: 4000
spike_limit_mib: 800
batch:
send_batch_size: 1000
timeout: 1s
send_batch_max_size: 1500
extensions:
memory_ballast:
size_mib: 2000
I have settings:
resources:
requests:
cpu: 2
memory: 4096Mi
limits:
memory: 8192Mi
processors:
memory_limiter:
check_interval: 3s
limit_mib: 6553
spike_limit_mib: 2048
I should try memory_ballast as well ...
Does anyone have normal scrape results from pods under load?
Added definitions for batch
batch:
send_batch_size: 1000
timeout: 1s
send_batch_max_size: 1500
Let's see how it will be
At first it worked fine. And then the problems started again:
After a few days of work (5), history repeats itself.
Hey @pilot513 , did you managed to fix this issue some how? I am also facing the same issue with histogram metrics export. Please help to suggest the solution if you managed to find the one.
Thanks
In my case, I noticed that the number of metrics was constantly growing. I began to study this issue, and discovered that one application was generating a constant increase in unique metrics. It shouldn't be this way. I pointed this out to the developers, and they fixed it because their code for expose metrics was incorrect. As soon as I reinstalled the application, the problem went away.
I'm seeing the same thing, memory usage keeps going up until the receiver starts falling. At that point I begin to see export failures and the export queue goes up as well. We're sending around 35K/s data points across 350 scrape targets.
Apr 10 06:34:35 otel-collector-7855596df8-lz6cv otel-collector error scrape/scrape.go:1351 Scrape commit failed {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "k8s", "target": "http://10.30.28.208:6666/metrics", "error": "data refused due to high memory usage"}
Apr 10 06:34:35 otel-collector-7855596df8-lz6cv otel-collector github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1
github.com/prometheus/[email protected]/scrape/scrape.go:1351
Apr 10 06:34:35 otel-collector-7855596df8-lz6cv otel-collector github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
github.com/prometheus/[email protected]/scrape/scrape.go:1429
Apr 10 06:34:35 otel-collector-7855596df8-lz6cv otel-collector github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
github.com/prometheus/[email protected]/scrape/scrape.go:1306
receivers:
prometheus:
config:
scrape_configs:
- job_name: k8s
tls_config:
insecure_skip_verify: true
scrape_interval: 10s
...
processors:
batch:
memory_limiter:
check_interval: 1s
limit_percentage: 80
spike_limit_percentage: 20
k8sattributes:
extract:
metadata:
- k8s.container.name
- k8s.namespace.name
- k8s.pod.name
- k8s.deployment.name
- k8s.replicaset.name
- k8s.node.name
- k8s.daemonset.name
- k8s.cronjob.name
- k8s.job.name
- k8s.statefulset.name
labels:
- tag_name: k8s.pod.label.app
key: app
from: pod
- tag_name: k8s.pod.label.component
key: component
from: pod
- tag_name: k8s.pod.label.zone
key: zone
from: pod
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.ip
- sources:
- from: resource_attribute
name: k8s.pod.uid
- sources:
- from: connection
transform/add-workload-label:
metric_statements:
- context: datapoint
statements:
- set(attributes["kube_workload_name"], resource.attributes["k8s.deployment.name"])
- set(attributes["kube_workload_name"], resource.attributes["k8s.statefulset.name"])
- set(attributes["kube_workload_type"], "deployment") where resource.attributes["k8s.deployment.name"] != nil
- set(attributes["kube_workload_type"], "statefulset") where resource.attributes["k8s.statefulset.name"] != nil
exporters:
prometheusremotewrite:
endpoint: ${env:PROMETHEUSREMOTEWRITE_ENDPOINT}
headers:
Authorization: ${env:PROMETHEUSREMOTEWRITE_TOKEN}
resource_to_telemetry_conversion:
enabled: true
max_batch_size_bytes: 2000000
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [memory_limiter, batch, k8sattributes, transform/add-workload-label]
exporters: [prometheusremotewrite]
containers:
- command:
- /otelcol-contrib
- --config=/conf/otel-collector-config.yaml
image: otel/opentelemetry-collector-contrib:0.96.0
imagePullPolicy: IfNotPresent
name: otel-collector
ports:
- containerPort: 55679
protocol: TCP
- containerPort: 4317
protocol: TCP
- containerPort: 4318
protocol: TCP
- containerPort: 14250
protocol: TCP
- containerPort: 14268
protocol: TCP
- containerPort: 9411
protocol: TCP
- containerPort: 8888
protocol: TCP
resources:
limits:
cpu: "4"
memory: 16Gi
requests:
cpu: "4"
memory: 16Gi
volumeMounts:
- mountPath: /conf
name: otel-collector-config-vol
env:
- name: "GOMEMLIMIT"
value: "12GiB" # 80% of memory request
envFrom:
- secretRef:
name: otel-collector
Just tested on v0.97, same failure pattern. I did noticed this error message as well:
Apr 10 07:45:24 otel-collector-7cbf899d45-cswl5 otel-collector error exporterhelper/queue_sender.go:101 Exporting failed. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded", "errorCauses": [{"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}], "dropped_items": 108155}
Apr 10 07:45:24 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:101
Apr 10 07:45:24 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:57
Apr 10 07:45:24 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43
Apr 10 07:45:34 otel-collector-7cbf899d45-cswl5 otel-collector error exporterhelper/queue_sender.go:101 Exporting failed. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: context deadline exceeded", "dropped_items": 108341}
Apr 10 07:45:34 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:101
Apr 10 07:45:34 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:57
Apr 10 07:45:34 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43
Met similar error on v0.97
I am facing similar issue..
Scrape continouly fails with the below error -
github.com/prometheus/[email protected]/scrape/scrape.go:1306 github.com/prometheus/prometheus/scrape.(*scrapeLoop).run github.com/prometheus/[email protected]/scrape/scrape.go:1429 github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport github.com/prometheus/[email protected]/scrape/scrape.go:1351 github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1 error scrape/scrape.go:1351 Scrape commit failed {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "itomperftesting-otel-collector-job", "target": "http://0.0.0.0:8888/metrics", "error": "data refused due to high memory usage"}
This also leads high memory usage at otel-collector..
Do we have any work-around for this ?