opentelemetry-collector scrape/scrape.go:1313 Scrape commit failed "error": "data refused due to high memory usage"

Describe the bug ota collector can't scrape pod metrics

Steps to reproduce Configure prometheus exporter with prometheus endpoint

What did you expect to see? Scrape metrics from pods and receive it to 'prometheusremotewrite'

What did you see instead? error scrape/scrape.go:1313 Scrape commit failed {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "otel_kubernetes_podscraper", "target": "http://ip:port/metrics", "error": "data refused due to high memory usage"}

What version did you use? Version: 0.82

What config did you use? Config:

... prometheus: endpoint: 0.0.0.0:port metric_expiration: 120m resource_to_telemetry_conversion: enabled: true send_timestamps: true prometheusremotewrite: endpoint: http://hostname/prometheus/api/v1/write extensions: health_check: {} memory_ballast: {} processors: batch: {} memory_limiter: check_interval: 3s limit_mib: 6553 spike_limit_mib: 2048 ...

Environment k8s Pod (from helm chart) with Limit 8G 2cores

Aug 10 '23 12:08 pilot513

I'm having the same problem with prometheus receiver performing scrapes since the container is started and have tried different collector versions and settings:

resources:
  requests:
    cpu: 500m
    memory: 4096Mi
  limits:
    memory: 4096Mi
 processors:
      memory_limiter:
        check_interval: 1s
        limit_mib: 4000
        spike_limit_mib: 800
      batch:
        send_batch_size: 1000
        timeout: 1s
        send_batch_max_size: 1500
 extensions:          
     memory_ballast:
          size_mib: 2000

Aug 10 '23 13:08 CarlosLanderas

I have settings:

resources:
  requests:
    cpu: 2
    memory: 4096Mi
  limits:
    memory: 8192Mi
 processors:
      memory_limiter:
        check_interval: 3s
        limit_mib: 6553
        spike_limit_mib: 2048

I should try memory_ballast as well ...

Aug 10 '23 15:08 pilot513

Does anyone have normal scrape results from pods under load?

Aug 10 '23 15:08 pilot513

Added definitions for batch

      batch:
        send_batch_size: 1000
        timeout: 1s
        send_batch_max_size: 1500

Let's see how it will be

Aug 10 '23 17:08 pilot513

At first it worked fine. And then the problems started again:

Aug 11 '23 08:08 pilot513

After a few days of work (5), history repeats itself.

Aug 20 '23 12:08 pilot513

Aug 20 '23 13:08 pilot513

Hey @pilot513 , did you managed to fix this issue some how? I am also facing the same issue with histogram metrics export. Please help to suggest the solution if you managed to find the one.

Thanks

Jan 19 '24 02:01 bhupeshpadiyar

In my case, I noticed that the number of metrics was constantly growing. I began to study this issue, and discovered that one application was generating a constant increase in unique metrics. It shouldn't be this way. I pointed this out to the developers, and they fixed it because their code for expose metrics was incorrect. As soon as I reinstalled the application, the problem went away.

Jan 19 '24 10:01 pilot513

I'm seeing the same thing, memory usage keeps going up until the receiver starts falling. At that point I begin to see export failures and the export queue goes up as well. We're sending around 35K/s data points across 350 scrape targets.

Apr 10 06:34:35 otel-collector-7855596df8-lz6cv otel-collector error scrape/scrape.go:1351	Scrape commit failed	{"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "k8s", "target": "http://10.30.28.208:6666/metrics", "error": "data refused due to high memory usage"}
Apr 10 06:34:35 otel-collector-7855596df8-lz6cv otel-collector github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1
	github.com/prometheus/[email protected]/scrape/scrape.go:1351
Apr 10 06:34:35 otel-collector-7855596df8-lz6cv otel-collector github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
	github.com/prometheus/[email protected]/scrape/scrape.go:1429
Apr 10 06:34:35 otel-collector-7855596df8-lz6cv otel-collector github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
	github.com/prometheus/[email protected]/scrape/scrape.go:1306

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: k8s
          tls_config:
            insecure_skip_verify: true
          scrape_interval: 10s
          ...
processors:
  batch:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 20
  k8sattributes:
    extract:
      metadata:
        - k8s.container.name
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.deployment.name
        - k8s.replicaset.name
        - k8s.node.name
        - k8s.daemonset.name
        - k8s.cronjob.name
        - k8s.job.name
        - k8s.statefulset.name
      labels:
        - tag_name: k8s.pod.label.app
          key: app
          from: pod
        - tag_name: k8s.pod.label.component
          key: component
          from: pod
        - tag_name: k8s.pod.label.zone
          key: zone
          from: pod
    pod_association:
      - sources:
        - from: resource_attribute
          name: k8s.pod.ip
      - sources:
        - from: resource_attribute
          name: k8s.pod.uid
      - sources:
        - from: connection
  transform/add-workload-label:
    metric_statements:
      - context: datapoint
        statements:
        - set(attributes["kube_workload_name"], resource.attributes["k8s.deployment.name"])
        - set(attributes["kube_workload_name"], resource.attributes["k8s.statefulset.name"])
        - set(attributes["kube_workload_type"], "deployment") where resource.attributes["k8s.deployment.name"] != nil
        - set(attributes["kube_workload_type"], "statefulset") where resource.attributes["k8s.statefulset.name"] != nil
exporters:
  prometheusremotewrite:
    endpoint: ${env:PROMETHEUSREMOTEWRITE_ENDPOINT}
    headers:
      Authorization: ${env:PROMETHEUSREMOTEWRITE_TOKEN}
    resource_to_telemetry_conversion:
      enabled: true
    max_batch_size_bytes: 2000000
service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [memory_limiter, batch, k8sattributes, transform/add-workload-label]
      exporters: [prometheusremotewrite]

containers:
- command:
  - /otelcol-contrib
  - --config=/conf/otel-collector-config.yaml
  image: otel/opentelemetry-collector-contrib:0.96.0
  imagePullPolicy: IfNotPresent
  name: otel-collector
  ports:
  - containerPort: 55679
    protocol: TCP
  - containerPort: 4317
    protocol: TCP
  - containerPort: 4318
    protocol: TCP
  - containerPort: 14250
    protocol: TCP
  - containerPort: 14268
    protocol: TCP
  - containerPort: 9411
    protocol: TCP
  - containerPort: 8888
    protocol: TCP
  resources:
    limits:
      cpu: "4"
      memory: 16Gi
    requests:
      cpu: "4"
      memory: 16Gi
  volumeMounts:
  - mountPath: /conf
    name: otel-collector-config-vol
  env:
  - name: "GOMEMLIMIT"
    value: "12GiB" # 80% of memory request
  envFrom:
  - secretRef:
      name: otel-collector

Apr 10 '24 06:04 martinohansen

Just tested on v0.97, same failure pattern. I did noticed this error message as well:

Apr 10 07:45:24 otel-collector-7cbf899d45-cswl5 otel-collector error exporterhelper/queue_sender.go:101	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded", "errorCauses": [{"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}], "dropped_items": 108155}
Apr 10 07:45:24 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
	go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:101
Apr 10 07:45:24 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
	go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:57
Apr 10 07:45:24 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
	go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43
Apr 10 07:45:34 otel-collector-7cbf899d45-cswl5 otel-collector error exporterhelper/queue_sender.go:101	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: context deadline exceeded", "dropped_items": 108341}
Apr 10 07:45:34 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
	go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:101
Apr 10 07:45:34 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
	go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:57
Apr 10 07:45:34 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
	go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43

Apr 10 '24 07:04 martinohansen

Met similar error on v0.97

May 27 '24 10:05 chenlujjj

I am facing similar issue..

Scrape continouly fails with the below error -

github.com/prometheus/[email protected]/scrape/scrape.go:1306 github.com/prometheus/prometheus/scrape.(*scrapeLoop).run github.com/prometheus/[email protected]/scrape/scrape.go:1429 github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport github.com/prometheus/[email protected]/scrape/scrape.go:1351 github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1 error scrape/scrape.go:1351 Scrape commit failed {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "itomperftesting-otel-collector-job", "target": "http://0.0.0.0:8888/metrics", "error": "data refused due to high memory usage"}

This also leads high memory usage at otel-collector..

Do we have any work-around for this ?

Jul 22 '24 05:07 garg031

opentelemetry-collector opentelemetry-collector copied to clipboard

scrape/scrape.go:1313 Scrape commit failed "error": "data refused due to high memory usage"

opentelemetry-collector
opentelemetry-collector copied to clipboard