opentelemetry-collector-contrib Prometheus receiver intermittently dropping kubelet metrics

Component(s)

receiver/prometheus

What happened?

Description

We're using the Prometheus Receiver as a drop in replacement for Prometheus and it's mostly working as expected but we're currently having issues with gaps in the collector kubelet metrics. We have a pool of collectors and Target Allocator deployment, all managed by an OpentelemetryCollector resource by the Opentelemetry Operator, with the collectors writing to an upstream Thanos receive stack. We provide the same static config to the collectors that we provide to Prometheus (which we still currently have running alongside) and what we see is intermittent gaps in the metrics coming from kubelet, just one or 2 dropped scrapes but they cause gaps in the series that skew all of our node based metrics.

There's been some discussion on the CNCF slack so far: https://cloud-native.slack.com/archives/C01LSCJBXDZ/p1748621317097949

But so far we have investigated:

The collectors aren't under any resource strain when dropping series
The drops are only happening in kubernetes-nodes and kubernetes-nodes-cadvisor jobs
The drops are not consistent across all nodes a particular collector is scraping (i.e. all nodes from a particular collector experience it but not at the same time)
The drops don't correlate to collectors restarting or scaling or nodes starting/stopping
There's nothing obvious in the collector logs, even with detailed logging (some timeouts to the apiserver but those seem to relate to nodes starting/stopping rather than these gaps)
Prometheus still running in the same cluster with the same config does not have the same problem

I'm not entirely sure the issue is with the prometheus receiver, given it uses the same scrape logic as Prometheus itself I see no reason why it would have any issue. It could well be differences in the way the collector/exporter manage the series, particularly the kubelet ones with honor_timestamps: true but struggling to narrow down what exactly is happening.

Steps to Reproduce

Otel collector with the below static config, with prometheus remotewrite exporter to Thanos receive stack.

Expected Result

This is from Prometheus in the same cluster monitoring up{job="kubernetes-nodes"} for a given instance:

Actual Result

This is the same instance via the collector:

Collector version

0.122.1

Environment information

Environment

EKS: v1.31.9-eks-5d4a308

OpenTelemetry Collector configuration

exporters:
      debug: {}
      prometheusremotewrite:
        endpoint: ${THANOS_RECEIVER}
        remote_write_queue:
          enabled: true
          num_consumers: 5
        target_info:
          enabled: false
        timeout: 10s
    processors:
      memory_limiter:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 15
    receivers:
      prometheus:
        api_server:
          enabled: true
          server_config:
            endpoint: localhost:9090
        config:
          global:
            scrape_interval: 15s
          scrape_configs:
          - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            honor_timestamps: true
            job_name: kubernetes-nodes
            kubernetes_sd_configs:
            - role: node
            relabel_configs:
            - action: labelmap
              regex: __meta_kubernetes_node_label_(.+)
            - replacement: kubernetes.default.svc:443
              target_label: __address__
            - regex: (.+)
              replacement: /api/v1/nodes/$$1/proxy/metrics
              source_labels:
              - __meta_kubernetes_node_name
              target_label: __metrics_path__
            scheme: https
            tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              insecure_skip_verify: true
          - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            honor_timestamps: true
            job_name: kubernetes-nodes-cadvisor
            kubernetes_sd_configs:
            - role: node
            metric_relabel_configs:
            - action: drop
              regex: container_cpu_(load_average_10s|system_seconds_total|user_seconds_total)
              source_labels:
              - __name__
            - action: drop
              regex: container_fs_(io_current|io_time_seconds_total|io_time_weighted_seconds_total|reads_merged_total|sector_reads_total|sector_writes_total|writes_merged_total)
              source_labels:
              - __name__
            - action: drop
              regex: container_memory_(mapped_file|swap)
              source_labels:
              - __name__
            - action: drop
              regex: container_(tasks_state|threads_max)
              source_labels:
              - __name__
            - action: drop
              regex: container_spec_(cpu.*|memory_swap_limit_bytes|memory_reservation_limit_bytes)
              source_labels:
              - __name__
            - action: drop
              regex: .+;
              source_labels:
              - id
              - pod
            relabel_configs:
            - action: labelmap
              regex: __meta_kubernetes_node_label_(.+)
            - replacement: kubernetes.default.svc:443
              target_label: __address__
            - regex: (.+)
              replacement: /api/v1/nodes/$$1/proxy/metrics/cadvisor
              source_labels:
              - __meta_kubernetes_node_name
              target_label: __metrics_path__
            scheme: https
            tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              insecure_skip_verify: true
          - job_name: kubernetes-service-endpoints
            kubernetes_sd_configs:
            - role: endpoints
            relabel_configs:
...
          - job_name: kubernetes-pods
            kubernetes_sd_configs:
            - role: pod
            relabel_configs:
...
        target_allocator:
          collector_id: ${POD_NAME}
          endpoint: http://opentelemetry-metrics-targetallocator
          interval: 30s
    service:
      pipelines:
        metrics:
          exporters:
          - prometheusremotewrite
          - debug
          processors:
          - memory_limiter
          receivers:
          - prometheus

Log output

Additional context

No response

Jun 12 '25 16:06 alclark704

Pinging code owners:

receiver/prometheus: @Aneurysm9 @dashpole @ArthurSens @krajorama

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Jun 12 '25 16:06 github-actions[bot]

@alclark704 Hello!

Thanks for detailed information. I have not worked with prometheus before but does your collector instance throw any error? You can probably enable debug logging as well.

Jun 13 '25 06:06 VihasMakwana

Hi @VihasMakwana

Unfortunately, no errors or info that seems relevant around the time of the gaps via the debug exporter. I've also gone up to the detailed logging and so far not found anything useful in there though with multiple lines for every scrape it's quite hard to sift through.

Jun 13 '25 09:06 alclark704

Are you able to tell which collector is scraping which node at any point in time (e.g. by adding a constant label from each collector)? It would be helpful to know if that is happening when the target is being moved between collectors, or when it is assigned to a single one.

/api/v1/nodes/$$1/proxy/metrics

Note that this means you are proxying metrics through your kubernetes APIserver. Any APIServer timeout would mean you don't have metrics. Generally, I would recommend using a daemonset to directly scrape the kubelet, rather than sharding using the targetallocator and scraping by using the APIServer as a proxy.

Can you check the up metric for each target, and look at the scrape_* metrics the Prometheus receiver adds automatically? Those can help show whether this is because of failed scrapes or something else.

Jun 13 '25 13:06 dashpole

Hi @dashpole,

Thanks for the response! I can grab some dashboards if it helps but this is happening when a target is consistently on a single collector. We did have the same thought which was the rationale behind checking if the collectors were scaling up/down when the gaps occurred but that wasn't the case. I think with the consistent-hashing we shouldn't see targets pass between collectors unless the number of collectors changes (?). We do occasionally get duplicated targets where one is passed between 2 collectors but that's much less frequent and gets de-duplicated by Thanos.

Absolutely agree it's not the way we want to do the node metrics in particular long term. With the migration to otel we're trying to minimise changing too many things at once so were hoping to keep the config itself the same and just verify we can recreate the current stack with Otel for now but do want to change that in the future. Though it might be worth moving that up if this persists. And have also considered the API server timeouts, I checked the scrape duration on the metrics in question and found the duration was perfectly normal either side of the gap but the scrape duration metric itself is also missing in those gaps. And found it odd that I think under the hood Prometheus and the Collector should both be hitting the same endpoint with the same scrape logic but 1 consistently sees timeouts and the other doesn't see any, if it was due to the API server timing out. I did start digging through apiserver logs to see if there was anything in there but didn't go too far so maybe worth going back to that.

I think I've covered a bit of the last point but yeah the screenshots attached above are the up{} metrics for a given node and the scrape_* metrics are all also missing for the same node in the same gaps, like this:

Very much appreciate the suggestions and if there's anything else you can think of to check I'm happy to take a look.

Jun 13 '25 15:06 alclark704

If the up metric is missing during the window then that almost certainly indicates that the problem is happening after the Prometheus receiver. It emits up=0 when a scrape fails, so the metric should always be present even when scrapes fail.

Jun 13 '25 15:06 dashpole

Ok, that is good to know, thank you very much for your help. I guess next port of call is the remote write exporter so have just tried updating the component in the original body to that but not sure if that'll actually relabel the issue

Jun 16 '25 08:06 alclark704

Pinging code owners for exporter/prometheusremotewrite: @Aneurysm9 @rapphil @dashpole @ArthurSens @ywwg. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label.

Jun 16 '25 13:06 github-actions[bot]

This is just a wild guess, but here goes nothing: Before I started my adventure into the world of Otelcol a few days ago, I tried to use the agent-mode of prometheus to scrape my cluster metrics and send them to Mimir. I quickly encountered this issue while trying to scrape cAdvisor metrics:

https://github.com/prometheus/prometheus/issues/14973

This issue is possibly completely unrelated, but after seeing the words "kubelet metrics" and "gaps" in this issue description, I felt reminded of the linked prometheus issue.

Maybe there is something special about kubelet/cAdvisor metrics that make them hard to scrape by stateless agents like otelcol and prometheus-agent?

@alclark704 Maybe you want to try running a prometheus in "agent" mode to check if the same issue appears?

Jun 17 '25 21:06 ChristianCiach

Hi @ChristianCiach ,

It's a good suggestion but just tried out the agent as well and it looks like it doesn't suffer from the same issue. Below is the sum(up{job="kubernetes-nodes") by (instance, prometheus_replica) via collector and prom agent respectively and the same instance has gaps on the collector but not the agent.

I do feel like it's likely to be something in that area though. I double checked on our Thanos receive stacks and the out-of-order window is 30m, which should be more than enough, just based on the machine info metric interval of 5 minutes mentioned on the issue.

Jun 18 '25 12:06 alclark704

/label waiting-for-code-owners -waiting-for-author

Jun 23 '25 16:06 alclark704

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/prometheusremotewriteexporter: @Aneurysm9 @rapphil @dashpole @ArthurSens @ywwg
receiver/prometheusreceiver: @Aneurysm9 @dashpole @ArthurSens @krajorama

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Aug 25 '25 03:08 github-actions[bot]

Just confirming we are still seeing this issue and struggling to track down where it's coming from. Any suggestions for solutions to try or additional debugging to identify where the series are getting dropped would be much appreciated. One bit of info I realised I forgot to add to the ticket below. I was again just trying to identify where the metrics are getting lost and below was with the detailed logging on the debug exporter, filtering specifically for cadvisor_version_info. The consistent green pings are those logs for a node we were getting consistent metrics for (red line is the metric itself). The patchy green pings are from a node when the gaps occur (blue line showing the gap). There doesn't seem to be a consistent relationship of long gaps between recorded metric points and the gaps in the series but it does seem like there's something on the node or collector level that means we're not able to consistently record metric points for a given node and that's when we're seeing the gaps (?). This is also from the debug exporter so might not represent what's in the remote write exporter but I was just trying to narrow down where the issue is happening and this seems like it's potentially happening between the scrape cycle and actually exporting.

Sep 02 '25 10:09 alclark704

Just sharing on my side, I have the same issue and I already tested otlp exporter instead of remote write exporter, and the same issue is happening, so it's validating what @alclark704 just said about seems issue is happening between scrape and export.

Sep 11 '25 12:09 nicolastakashi

Have you set up the collector self-observability metrics (see https://opentelemetry.io/docs/collector/internal-telemetry/)? That should have metrics from all of the components that can tell you if any of them are dropping points.

Sep 15 '25 19:09 dashpole

I've had the self-observability metrics enabled and can't see any points dropped/failed around when the gaps are:

That's showing the gaps in the cadvisor metrics for a given node with a sum of all the potential failed/dropped metrics. I don't seem to even be getting series for processors dropped/refused metrics (they do appear when i.e. the memory limiter kicks in) but yeah there doesn't seem to be anything getting actively failed/dropped/refused.

We had a discussion on Friday and based on what we are seeing I think the suspicion is that the kubelet series are occasionally being marked as stale for some reason. This is based on the fact that these metrics are being remote written to and queried from Thanos, which has a 5 minute lookback window. When we see gaps in queries, the equivalent gaps in the debug exporter are less than 5 minutes, so in that case we would expect Thanos to take the last sent value until either a new value is sent or 5 minutes passes. This implies it's not just that metrics aren't being sent, but potentially something is being actively sent to mark the series as stale.

I'm still trying to track this with the debug exporter to see if the last value sent before the gaps is a NaN or equivalent but if anyone has more context and could help us debug if that is the case that would be much appreciated.

I did also see some otel_scraper_ metrics in the internal telemetry that might be relevant for us if just to rule out an error on scraping but they don't seem to appear in my collector metrics (Running 0.133.0 now and I have the level set to normal which should include them). I'd expect at least the otelcol_scraper_scraped_metric_points series to appear but maybe this isn't implemented by the prometheus receiver.

Sep 16 '25 10:09 alclark704

Yeah, the prometheus receiver doesn't use the common scraper library. It uses the Promethues server's scraping logic

Sep 16 '25 13:09 dashpole

@alclark704 @alclark704 my finds and trials so far:

I am seeing target metrics go missing rather than just returning up == 0. It looks like the entire target disappears for a while. That includes up, scrape_.*, and other series. I suspect something is stalling the scrape path or the target is being dropped.

Setup

I am running a simple layout:

Target Allocator using Prometheus Operator CRs to discover targets and distribute them across collector instances
OpenTelemetry Collector as a StatefulSet with:
- Prometheus Receiver integrated with the Target Allocator
- Prometheus Remote Write Receiver

How I scrape kubelet

I use the Prometheus Operator strategy.

apiVersion: v1
kind: Service
metadata:
  name: prometheus-operator-kubelet
  namespace: kube-system
spec:
  clusterIP: None
  clusterIPs:
  - None
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  - IPv6
  ipFamilyPolicy: RequireDualStack
  ports:
  - name: https-metrics
    port: 10250
    protocol: TCP
    targetPort: 10250
  - name: http-metrics
    port: 10255
    protocol: TCP
    targetPort: 10255
  - name: cadvisor
    port: 4194
    protocol: TCP
    targetPort: 4194
  sessionAffinity: None
  type: ClusterIP

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: prometheus-operator-kubelet
spec:
  attachMetadata:
    node: false
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    honorLabels: true
    honorTimestamps: true
    interval: 30s
    metricRelabelings:
    - action: replace
      sourceLabels:
      - __metrics_path__
      targetLabel: metrics_path
    port: https-metrics
    scheme: https
    tlsConfig:
      caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      insecureSkipVerify: true
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    honorLabels: true
    honorTimestamps: true
    interval: 30s
    metricRelabelings:
    - action: labeldrop
      regex: (id|name)
    path: /metrics/cadvisor
    port: https-metrics
    scheme: https
    tlsConfig:
      caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      insecureSkipVerify: true
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    honorLabels: true
    honorTimestamps: true
    interval: 30s
    metricRelabelings:
    - action: drop
      regex: prober_probe_(duration_seconds.*|total)
      sourceLabels:
      - __name__
    path: /metrics/probes
    port: https-metrics
    scheme: https
    tlsConfig:
      caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      insecureSkipVerify: true
  jobLabel: k8s-app
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      app.kubernetes.io/name: kubelet
      k8s-app: kubelet

This is a different setup from @alclark704, which looks like it uses the API server proxy.

What I tried

Enabled debug and development logs on the Target Allocator and the Collector. I expected clear errors such as scrape failures or timeouts. I did not see that. If the scrape had failed I would still expect up == 0, not the entire series missing.
Disabled honor timestamps. I read a few threads about cAdvisor adding timestamps that could mark series as stale. Disabling did not help. Also, even if timestamps were the cause, I would expect a few series to drop, not every series including up and scrape_*.
Switched to OTLP exporter. I moved from Prometheus Remote Write to OTLP just to rule out exporter issues. The gap still happens.
Moved kubelet scrape from StatefulSet to DaemonSet. We use OpenTelemetry for logs and traces too, so I moved the kubelet scrape to the agent DaemonSet for a test. That works. No gaps. Good on one side, but it raises the question below.

Open question

Why does the scrape work when collectors run as DaemonSet agents but shows missing metrics when collectors run as a StatefulSet, even though I am not using the API server proxy in either case

Sep 18 '25 19:09 nicolastakashi

Update from my side at least. It appears we were hitting this separate issue in the targetallocator: https://github.com/open-telemetry/opentelemetry-operator/issues/4072

Which makes sense why we couldn't track down the issue in the collector itself! So upgrading the collector and TA has resolved the gaps for myself.

Oct 01 '25 16:10 alclark704

opentelemetry-collector-contrib opentelemetry-collector-contrib copied to clipboard

Prometheus receiver intermittently dropping kubelet metrics

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

Setup

How I scrape kubelet

What I tried

Open question

opentelemetry-collector-contrib
opentelemetry-collector-contrib copied to clipboard