thanos
thanos copied to clipboard
HTTP status 409 Conflict Prometheus to Thanos Receiver Metrics
Hello All,
I have built an integration between Prometheus and Thanos Receiver. This is through the Open Telemetry Collector. Below is the part of ConfigMap in Open Telemetry Collector.
otel-collector-config.yaml: |- receivers: prometheus: config: scrape_configs:
- job_name: "otel-collector-monitoring-1" scrape_interval: 5s honor_labels: true metrics_path: '/federate' params: 'match[]':
- '{job=~".+"}' static_configs:
- targets:
- 'prometheus-k8s:9090' exporters: prometheus: endpoint: "0.0.0.0:8889" metric_expiration: 180m prometheusremotewrite: endpoint: http://host.minikube.internal:32200/api/v1/receive
I see the metrics correctly landing in Thanos-Receiver. But Open Telemetry Collector keeps throwing continuously this error: What could be the reason for same? and also the fix?
2022-09-27T09:42:23.200Z error exporterhelper/queued_retry.go:183 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: remote write returned HTTP status 409 Conflict; err = %!w(): store locally for endpoint thanos-receive-default-0.thanos-receive-default.thanos.svc.cluster.local:10901: conflict\n", "dropped_items": 91} go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:183 go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send go.opentelemetry.io/[email protected]/exporter/exporterhelper/metrics.go:132 go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1 go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry_inmemory.go:119 go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:82 go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2 go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:69
Hi, I am experiencing the same situation here, but just when traffic is high. Did it ever get resolved or any clues of some configuration that can fix that?
Hey folks, Have you tried exporting to Prometheus as well? Could you post your full collector configurations?
I vaguely remember there are some related issues on the exporter in OTEL repo, this could be potentially bug where the data is not being exported correctly or could be also misconfiguration (especially in case you have multiple collector instances).
Related issue https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/11438
sure, Otel-collector version opentelemetry-collector-contrib:0.68.0 and config yaml, the project I work, builds several otelcollectors.
---
receivers:
kafka:
brokers:
- kafka:9092
topic: otlp_metrics-sink-id-222
protocol_version: 2.0.0
extensions:
pprof:
endpoint: 0.0.0.0:1888
basicauth/exporter:
client_auth:
username: <<user>
password: <<pass>
exporters:
prometheusremotewrite:
endpoint: <<endpoint>>
auth:
authenticator: basicauth/exporter
logging:
verbosity: detailed
sampling_initial: 5
sampling_thereafter: 50
service:
extensions:
- pprof
- basicauth/exporter
pipelines:
metrics:
receivers:
- kafka
exporters:
- prometheusremotewrite
This is a new scenario for us, we didn't have any issues running through a Mimir. but we faced this in thanos. The load is a bit higher than from the mimir scenario as well.
Thanks @matej-g
Hey @lpegoraro, Thanks for providing that extra info.
What is your receiver setup? Are you using replication? Do you see all your data as expected or are some data points missing? (Feel free to post your receiver config as well).
It could be that in such case the 409 is benign. It can signal that your receiver instances already have this data, and so it refuses to write it. This could happen in a scenario with high load and replication.
am also facing the same issue
This is generally not a problem, it often happens when agents retry sending samples. Unless data is missing, the error can be ignored.
how can add an option to ignore?
I was also facing the same issue when ingesting metrics using OTel collector prometheusremotewrite
exporter. The culprit seems to be the target_info
metrics which are enabled by default. Sometime these target info metrics have some labels without any value (mostly net_host_port
) which causes 409 conflict
error on the thanos side.
The issue got fixed when I disabled the target_info
metrics from OTel side.
exporters:
prometheusremotewrite:
target_info:
enabled: false
Same issue, even with target_info
disabled. Reproducible with:
receivers:
hostmetrics:
root_path: /hostfs
collection_interval: 10s
scrapers:
cpu:
disk:
filesystem:
load:
memory:
network:
paging:
processes:
processors:
batch:
send_batch_size: 10000
send_batch_max_size: 11000
timeout: 10s
resourcedetection:
detectors: [env, system]
timeout: 2s
override: false
exporters:
prometheusremotewrite:
endpoint: "http://promthanos-receiver.foo.bar/api/v1/receive"
external_labels:
env: staging
resource_to_telemetry_conversion:
enabled: true
target_info:
enabled: false
service:
pipelines:
metrics:
receivers: [hostmetrics]
processors: [resourcedetection, batch]
exporters: [prometheusremotewrite]
Ok, i can also confirm that the disable of "target_info" does not solves the issue
It solved the problem on my side
On Mon, 6 May 2024, 18:58 Fabien Lenoir, @.***> wrote:
Ok, i can also confirm that the disable of "target_info" does not solves the issue
— Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/5732#issuecomment-2096388047, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIQH7IM2WSENYA6TWEPIWJDZA6SD5AVCNFSM6AAAAAAQWTPZ5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJWGM4DQMBUG4 . You are receiving this because you commented.Message ID: @.***>
hi @kimpetertanui , how did you managed to solve it ? just with the disable ? Can you post your collector config ? i suspected to have alos relation with telemetry_conversion but it did not
I added below to my otel collector. target_info section is what I was missing
exporters: prometheusremotewrite: target_info: enabled: false
On Mon, 6 May 2024, 19:04 Fabien Lenoir, @.***> wrote:
hi @kimpetertanui https://github.com/kimpetertanui , how did you managed to solve it ?
— Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/5732#issuecomment-2096398193, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIQH7IIY567ULY53FTRNLE3ZA6SXXAVCNFSM6AAAAAAQWTPZ5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJWGM4TQMJZGM . You are receiving this because you were mentioned.Message ID: @.***>
I notice that errors are limited to 3 or 4 each time. On Thanos receiver side, there's 4 replicas. Would it be possible that there's a missing setting on thanos that would explain that the remote write tries to write 4 times same metrics at the same time, which would explain this conflict ?
We have similar set-up and even though we've disabled target_info
we still receive a lot of 409 Conflicts as well as Out of Order responses... not quite sure what can be done here as receiver does not have many options to control it's behaviour - do you think if I'd deploy kafka in front of a receiver that would help with "out of order" issues and 409's ?
Compared to Prometheus, Thanos receives 4k less metrics:
I think there may be a relation with thanos receiver replicas and replication factor. How many replicas of pod receiver do you have ? what is the replication factor ?
@flenoir currently we do have 3 replicas of receiver and replicationfactor = 2 on receiver & receiver distributor.
The best way I know you can debug this is by changing the log level of the receiver from info
to debug
. Then you should be able to see what is causing the conflict. For e.g. here is a sample log that I was seeing
Labels with empty name in the label set" lset="labels:<name:"__name__" value:"target_info" > labels:<name:"http_scheme" value:"http" > labels:<name:"instance" value:"instance-xyz" > labels:<name:"job" value:"demo-job" > labels:<name:"net_host_name" value:"host-xyz" > labels:<name:"net_host_port" >
If you look closely the last label net_host_port
doesn't have a value
like other labels. Which was causing this error.