thanos icon indicating copy to clipboard operation
thanos copied to clipboard

HTTP status 409 Conflict Prometheus to Thanos Receiver Metrics

Open amit-patil opened this issue 2 years ago • 18 comments

Hello All,

I have built an integration between Prometheus and Thanos Receiver. This is through the Open Telemetry Collector. Below is the part of ConfigMap in Open Telemetry Collector.

otel-collector-config.yaml: |- receivers: prometheus: config: scrape_configs:

  • job_name: "otel-collector-monitoring-1" scrape_interval: 5s honor_labels: true metrics_path: '/federate' params: 'match[]':
  • '{job=~".+"}' static_configs:
  • targets:
  • 'prometheus-k8s:9090' exporters: prometheus: endpoint: "0.0.0.0:8889" metric_expiration: 180m prometheusremotewrite: endpoint: http://host.minikube.internal:32200/api/v1/receive

I see the metrics correctly landing in Thanos-Receiver. But Open Telemetry Collector keeps throwing continuously this error: What could be the reason for same? and also the fix?

2022-09-27T09:42:23.200Z error exporterhelper/queued_retry.go:183 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: remote write returned HTTP status 409 Conflict; err = %!w(): store locally for endpoint thanos-receive-default-0.thanos-receive-default.thanos.svc.cluster.local:10901: conflict\n", "dropped_items": 91} go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:183 go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send go.opentelemetry.io/[email protected]/exporter/exporterhelper/metrics.go:132 go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1 go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry_inmemory.go:119 go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:82 go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2 go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:69

amit-patil avatar Sep 27 '22 09:09 amit-patil

Hi, I am experiencing the same situation here, but just when traffic is high. Did it ever get resolved or any clues of some configuration that can fix that?

lpegoraro avatar Mar 16 '23 20:03 lpegoraro

Hey folks, Have you tried exporting to Prometheus as well? Could you post your full collector configurations?

I vaguely remember there are some related issues on the exporter in OTEL repo, this could be potentially bug where the data is not being exported correctly or could be also misconfiguration (especially in case you have multiple collector instances).

Related issue https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/11438

matej-g avatar Mar 17 '23 11:03 matej-g

sure, Otel-collector version opentelemetry-collector-contrib:0.68.0 and config yaml, the project I work, builds several otelcollectors.

---
receivers:
  kafka:
    brokers:
    - kafka:9092
    topic: otlp_metrics-sink-id-222
    protocol_version: 2.0.0
extensions:
  pprof:
    endpoint: 0.0.0.0:1888
  basicauth/exporter:
    client_auth:
      username: <<user>
      password: <<pass>
exporters:
  prometheusremotewrite:
    endpoint: <<endpoint>>
    auth:
      authenticator: basicauth/exporter
  logging:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 50
service:
  extensions:
  - pprof
  - basicauth/exporter
  pipelines:
    metrics:
      receivers:
      - kafka
      exporters:
      - prometheusremotewrite

This is a new scenario for us, we didn't have any issues running through a Mimir. but we faced this in thanos. The load is a bit higher than from the mimir scenario as well.

Thanks @matej-g

lpegoraro avatar Mar 17 '23 17:03 lpegoraro

Hey @lpegoraro, Thanks for providing that extra info.

What is your receiver setup? Are you using replication? Do you see all your data as expected or are some data points missing? (Feel free to post your receiver config as well).

It could be that in such case the 409 is benign. It can signal that your receiver instances already have this data, and so it refuses to write it. This could happen in a scenario with high load and replication.

matej-g avatar Mar 22 '23 09:03 matej-g

am also facing the same issue

kimpetertanui avatar Jun 30 '23 21:06 kimpetertanui

This is generally not a problem, it often happens when agents retry sending samples. Unless data is missing, the error can be ignored.

fpetkovski avatar Jul 03 '23 12:07 fpetkovski

how can add an option to ignore?

kimpetertanui avatar Jul 12 '23 09:07 kimpetertanui

I was also facing the same issue when ingesting metrics using OTel collector prometheusremotewrite exporter. The culprit seems to be the target_info metrics which are enabled by default. Sometime these target info metrics have some labels without any value (mostly net_host_port) which causes 409 conflict error on the thanos side. The issue got fixed when I disabled the target_info metrics from OTel side.

exporters:
  prometheusremotewrite:
    target_info:
      enabled: false

Ygshvr avatar Mar 12 '24 11:03 Ygshvr

Same issue, even with target_info disabled. Reproducible with:

receivers:
  hostmetrics:
    root_path: /hostfs
    collection_interval: 10s
    scrapers:
      cpu:
      disk:
      filesystem:
      load:
      memory:
      network:
      paging:
      processes:
processors:
  batch:
    send_batch_size: 10000
    send_batch_max_size: 11000
    timeout: 10s
  resourcedetection:
    detectors: [env, system]
    timeout: 2s
    override: false

exporters:
  prometheusremotewrite:
    endpoint: "http://promthanos-receiver.foo.bar/api/v1/receive"
    external_labels:
      env: staging
    resource_to_telemetry_conversion:
      enabled: true
    target_info:
      enabled: false


service:
  pipelines:
    metrics:
      receivers: [hostmetrics]
      processors: [resourcedetection, batch]
      exporters: [prometheusremotewrite]

ErvalhouS avatar Apr 10 '24 16:04 ErvalhouS

Ok, i can also confirm that the disable of "target_info" does not solves the issue

flenoir avatar May 06 '24 15:05 flenoir

It solved the problem on my side

On Mon, 6 May 2024, 18:58 Fabien Lenoir, @.***> wrote:

Ok, i can also confirm that the disable of "target_info" does not solves the issue

— Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/5732#issuecomment-2096388047, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIQH7IM2WSENYA6TWEPIWJDZA6SD5AVCNFSM6AAAAAAQWTPZ5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJWGM4DQMBUG4 . You are receiving this because you commented.Message ID: @.***>

kimpetertanui avatar May 06 '24 16:05 kimpetertanui

hi @kimpetertanui , how did you managed to solve it ? just with the disable ? Can you post your collector config ? i suspected to have alos relation with telemetry_conversion but it did not

flenoir avatar May 06 '24 16:05 flenoir

I added below to my otel collector. target_info section is what I was missing

exporters: prometheusremotewrite: target_info: enabled: false

On Mon, 6 May 2024, 19:04 Fabien Lenoir, @.***> wrote:

hi @kimpetertanui https://github.com/kimpetertanui , how did you managed to solve it ?

— Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/5732#issuecomment-2096398193, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIQH7IIY567ULY53FTRNLE3ZA6SXXAVCNFSM6AAAAAAQWTPZ5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJWGM4TQMJZGM . You are receiving this because you were mentioned.Message ID: @.***>

kimpetertanui avatar May 06 '24 16:05 kimpetertanui

I notice that errors are limited to 3 or 4 each time. On Thanos receiver side, there's 4 replicas. Would it be possible that there's a missing setting on thanos that would explain that the remote write tries to write 4 times same metrics at the same time, which would explain this conflict ? image

flenoir avatar May 07 '24 08:05 flenoir

We have similar set-up and even though we've disabled target_info we still receive a lot of 409 Conflicts as well as Out of Order responses... not quite sure what can be done here as receiver does not have many options to control it's behaviour - do you think if I'd deploy kafka in front of a receiver that would help with "out of order" issues and 409's ?

Screenshot 2024-05-10 at 10 44 26 AM

Compared to Prometheus, Thanos receives 4k less metrics:

Screenshot 2024-05-10 at 10 47 03 AM

evilr00t avatar May 10 '24 09:05 evilr00t

I think there may be a relation with thanos receiver replicas and replication factor. How many replicas of pod receiver do you have ? what is the replication factor ?

flenoir avatar May 13 '24 09:05 flenoir

@flenoir currently we do have 3 replicas of receiver and replicationfactor = 2 on receiver & receiver distributor.

evilr00t avatar May 20 '24 10:05 evilr00t

The best way I know you can debug this is by changing the log level of the receiver from info to debug. Then you should be able to see what is causing the conflict. For e.g. here is a sample log that I was seeing

Labels with empty name in the label set" lset="labels:<name:"__name__" value:"target_info" > labels:<name:"http_scheme" value:"http" > labels:<name:"instance" value:"instance-xyz" > labels:<name:"job" value:"demo-job" > labels:<name:"net_host_name" value:"host-xyz" > labels:<name:"net_host_port" >

If you look closely the last label net_host_port doesn't have a value like other labels. Which was causing this error.

Ygshvr avatar May 20 '24 13:05 Ygshvr