stackdriver_exporter
stackdriver_exporter copied to clipboard
Error on ingesting out-of-order samples
After upgrading from 0.8.0 to 0.11.0 we started noticing a spam of warning on Prometheus servers logs.
This error appears once we use 0.9.1 and is still there up to version 0.11.0.
Prometheus log entry
{"caller":"scrape.go:1467","component":"scrape manager","level":"warn","msg":"Error on ingesting out-of-order samples","num_dropped":197,"scrape_pool":"kubernetes-service-endpoints","target":"http://10.34.95.168:9255/metrics","ts":"2021-06-10T16:59:22.545Z"}
@SuperQ any idea what might be the issue ? I am running the exporter with the following config env variables.
STACKDRIVER_EXPORTER_MONITORING_METRICS_INTERVAL: 5m
STACKDRIVER_EXPORTER_MONITORING_METRICS_OFFSET: 0s
STACKDRIVER_EXPORTER_WEB_LISTEN_ADDRESS: :9255
STACKDRIVER_EXPORTER_WEB_TELEMETRY_PATH: /metrics
STACKDRIVER_EXPORTER_MAX_RETRIES: 0
STACKDRIVER_EXPORTER_HTTP_TIMEOUT: 30s
STACKDRIVER_EXPORTER_MAX_BACKOFF_DURATION: 5s
STACKDRIVER_EXPORTER_BACKODFF_JITTER_BASE: 1s
STACKDRIVER_EXPORTER_RETRY_STATUSES: 503
STACKDRIVER_EXPORTER_DROP_DELEGATED_PROJECTS: false
I managed to get rid of those errors by introduced an offset for the metrics pulled from Cloud Monitoring (Stackdriver). This issue seems to be caused by the ingestion delay : https://cloud.google.com/monitoring/api/metrics#metadata
@anas-aso Did you set the offset to 240 seconds?
@weyert 1m offset was enough to get rid of the errors. Also I didn't want to get close to Prometheus query.lookback-delta default value of 5m, otherwise I would have to touch many alerts queries.
Thanks @anas-aso. I will give that a shot
Do you use a version of the exporter with your PR merged?
Do you use a version of the exporter with your PR merged?
@weyert I only tried it on a staging environment before opening the PR. The problem with that change is that you will get metrics (within the same scrape) that are offset by different value (since the offset is introduced dynamically from GCP Monitoring API metadata). As a result, writing alerts require extra an extra step of checking the introduced offset of a metric before using it. That's why I haven't used my patch in prod yet, because I want to get the maintainers feedback first.
Thank you, that's good point. I didn't think of that. Hopefully the maintainers will give you feedback :)
setting STACKDRIVER_EXPORTER_MONITORING_METRICS_OFFSET tp 30s fixed the issue for us
I stopped using Stackdriver Exporter in favor of https://github.com/GoogleCloudPlatform/prometheus-engine/tree/main/cmd/frontend.