opentelemetry-collector-contrib
opentelemetry-collector-contrib copied to clipboard
Prometheus receiver intermittently dropping kubelet metrics
Component(s)
receiver/prometheus
What happened?
Description
We're using the Prometheus Receiver as a drop in replacement for Prometheus and it's mostly working as expected but we're currently having issues with gaps in the collector kubelet metrics. We have a pool of collectors and Target Allocator deployment, all managed by an OpentelemetryCollector resource by the Opentelemetry Operator, with the collectors writing to an upstream Thanos receive stack. We provide the same static config to the collectors that we provide to Prometheus (which we still currently have running alongside) and what we see is intermittent gaps in the metrics coming from kubelet, just one or 2 dropped scrapes but they cause gaps in the series that skew all of our node based metrics.
There's been some discussion on the CNCF slack so far: https://cloud-native.slack.com/archives/C01LSCJBXDZ/p1748621317097949
But so far we have investigated:
- The collectors aren't under any resource strain when dropping series
- The drops are only happening in kubernetes-nodes and kubernetes-nodes-cadvisor jobs
- The drops are not consistent across all nodes a particular collector is scraping (i.e. all nodes from a particular collector experience it but not at the same time)
- The drops don't correlate to collectors restarting or scaling or nodes starting/stopping
- There's nothing obvious in the collector logs, even with detailed logging (some timeouts to the apiserver but those seem to relate to nodes starting/stopping rather than these gaps)
- Prometheus still running in the same cluster with the same config does not have the same problem
I'm not entirely sure the issue is with the prometheus receiver, given it uses the same scrape logic as Prometheus itself I see no reason why it would have any issue. It could well be differences in the way the collector/exporter manage the series, particularly the kubelet ones with honor_timestamps: true but struggling to narrow down what exactly is happening.
Steps to Reproduce
Otel collector with the below static config, with prometheus remotewrite exporter to Thanos receive stack.
Expected Result
This is from Prometheus in the same cluster monitoring up{job="kubernetes-nodes"} for a given instance:
Actual Result
This is the same instance via the collector:
Collector version
0.122.1
Environment information
Environment
EKS: v1.31.9-eks-5d4a308
OpenTelemetry Collector configuration
exporters:
debug: {}
prometheusremotewrite:
endpoint: ${THANOS_RECEIVER}
remote_write_queue:
enabled: true
num_consumers: 5
target_info:
enabled: false
timeout: 10s
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
receivers:
prometheus:
api_server:
enabled: true
server_config:
endpoint: localhost:9090
config:
global:
scrape_interval: 15s
scrape_configs:
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
honor_timestamps: true
job_name: kubernetes-nodes
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$$1/proxy/metrics
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
honor_timestamps: true
job_name: kubernetes-nodes-cadvisor
kubernetes_sd_configs:
- role: node
metric_relabel_configs:
- action: drop
regex: container_cpu_(load_average_10s|system_seconds_total|user_seconds_total)
source_labels:
- __name__
- action: drop
regex: container_fs_(io_current|io_time_seconds_total|io_time_weighted_seconds_total|reads_merged_total|sector_reads_total|sector_writes_total|writes_merged_total)
source_labels:
- __name__
- action: drop
regex: container_memory_(mapped_file|swap)
source_labels:
- __name__
- action: drop
regex: container_(tasks_state|threads_max)
source_labels:
- __name__
- action: drop
regex: container_spec_(cpu.*|memory_swap_limit_bytes|memory_reservation_limit_bytes)
source_labels:
- __name__
- action: drop
regex: .+;
source_labels:
- id
- pod
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$$1/proxy/metrics/cadvisor
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- job_name: kubernetes-service-endpoints
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
...
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
...
target_allocator:
collector_id: ${POD_NAME}
endpoint: http://opentelemetry-metrics-targetallocator
interval: 30s
service:
pipelines:
metrics:
exporters:
- prometheusremotewrite
- debug
processors:
- memory_limiter
receivers:
- prometheus
Log output
Additional context
No response
Pinging code owners:
- receiver/prometheus: @Aneurysm9 @dashpole @ArthurSens @krajorama
See Adding Labels via Comments if you do not have permissions to add labels yourself.
@alclark704 Hello!
Thanks for detailed information. I have not worked with prometheus before but does your collector instance throw any error? You can probably enable debug logging as well.
Hi @VihasMakwana
Unfortunately, no errors or info that seems relevant around the time of the gaps via the debug exporter. I've also gone up to the detailed logging and so far not found anything useful in there though with multiple lines for every scrape it's quite hard to sift through.
Are you able to tell which collector is scraping which node at any point in time (e.g. by adding a constant label from each collector)? It would be helpful to know if that is happening when the target is being moved between collectors, or when it is assigned to a single one.
/api/v1/nodes/$$1/proxy/metrics
Note that this means you are proxying metrics through your kubernetes APIserver. Any APIServer timeout would mean you don't have metrics. Generally, I would recommend using a daemonset to directly scrape the kubelet, rather than sharding using the targetallocator and scraping by using the APIServer as a proxy.
Can you check the up metric for each target, and look at the scrape_* metrics the Prometheus receiver adds automatically? Those can help show whether this is because of failed scrapes or something else.
Hi @dashpole,
Thanks for the response! I can grab some dashboards if it helps but this is happening when a target is consistently on a single collector. We did have the same thought which was the rationale behind checking if the collectors were scaling up/down when the gaps occurred but that wasn't the case. I think with the consistent-hashing we shouldn't see targets pass between collectors unless the number of collectors changes (?). We do occasionally get duplicated targets where one is passed between 2 collectors but that's much less frequent and gets de-duplicated by Thanos.
Absolutely agree it's not the way we want to do the node metrics in particular long term. With the migration to otel we're trying to minimise changing too many things at once so were hoping to keep the config itself the same and just verify we can recreate the current stack with Otel for now but do want to change that in the future. Though it might be worth moving that up if this persists. And have also considered the API server timeouts, I checked the scrape duration on the metrics in question and found the duration was perfectly normal either side of the gap but the scrape duration metric itself is also missing in those gaps. And found it odd that I think under the hood Prometheus and the Collector should both be hitting the same endpoint with the same scrape logic but 1 consistently sees timeouts and the other doesn't see any, if it was due to the API server timing out. I did start digging through apiserver logs to see if there was anything in there but didn't go too far so maybe worth going back to that.
I think I've covered a bit of the last point but yeah the screenshots attached above are the up{} metrics for a given node and the scrape_* metrics are all also missing for the same node in the same gaps, like this:
Very much appreciate the suggestions and if there's anything else you can think of to check I'm happy to take a look.
If the up metric is missing during the window then that almost certainly indicates that the problem is happening after the Prometheus receiver. It emits up=0 when a scrape fails, so the metric should always be present even when scrapes fail.
Ok, that is good to know, thank you very much for your help. I guess next port of call is the remote write exporter so have just tried updating the component in the original body to that but not sure if that'll actually relabel the issue
Pinging code owners for exporter/prometheusremotewrite: @Aneurysm9 @rapphil @dashpole @ArthurSens @ywwg. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label.
This is just a wild guess, but here goes nothing: Before I started my adventure into the world of Otelcol a few days ago, I tried to use the agent-mode of prometheus to scrape my cluster metrics and send them to Mimir. I quickly encountered this issue while trying to scrape cAdvisor metrics:
- https://github.com/prometheus/prometheus/issues/14973
This issue is possibly completely unrelated, but after seeing the words "kubelet metrics" and "gaps" in this issue description, I felt reminded of the linked prometheus issue.
Maybe there is something special about kubelet/cAdvisor metrics that make them hard to scrape by stateless agents like otelcol and prometheus-agent?
@alclark704 Maybe you want to try running a prometheus in "agent" mode to check if the same issue appears?
Hi @ChristianCiach ,
It's a good suggestion but just tried out the agent as well and it looks like it doesn't suffer from the same issue. Below is the sum(up{job="kubernetes-nodes") by (instance, prometheus_replica) via collector and prom agent respectively and the same instance has gaps on the collector but not the agent.
I do feel like it's likely to be something in that area though. I double checked on our Thanos receive stacks and the out-of-order window is 30m, which should be more than enough, just based on the machine info metric interval of 5 minutes mentioned on the issue.
/label waiting-for-code-owners -waiting-for-author
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
- exporter/prometheusremotewriteexporter: @Aneurysm9 @rapphil @dashpole @ArthurSens @ywwg
- receiver/prometheusreceiver: @Aneurysm9 @dashpole @ArthurSens @krajorama
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Just confirming we are still seeing this issue and struggling to track down where it's coming from. Any suggestions for solutions to try or additional debugging to identify where the series are getting dropped would be much appreciated. One bit of info I realised I forgot to add to the ticket below. I was again just trying to identify where the metrics are getting lost and below was with the detailed logging on the debug exporter, filtering specifically for cadvisor_version_info. The consistent green pings are those logs for a node we were getting consistent metrics for (red line is the metric itself). The patchy green pings are from a node when the gaps occur (blue line showing the gap). There doesn't seem to be a consistent relationship of long gaps between recorded metric points and the gaps in the series but it does seem like there's something on the node or collector level that means we're not able to consistently record metric points for a given node and that's when we're seeing the gaps (?). This is also from the debug exporter so might not represent what's in the remote write exporter but I was just trying to narrow down where the issue is happening and this seems like it's potentially happening between the scrape cycle and actually exporting.
Just sharing on my side, I have the same issue and I already tested otlp exporter instead of remote write exporter, and the same issue is happening, so it's validating what @alclark704 just said about seems issue is happening between scrape and export.
Have you set up the collector self-observability metrics (see https://opentelemetry.io/docs/collector/internal-telemetry/)? That should have metrics from all of the components that can tell you if any of them are dropping points.
I've had the self-observability metrics enabled and can't see any points dropped/failed around when the gaps are:
That's showing the gaps in the cadvisor metrics for a given node with a sum of all the potential failed/dropped metrics. I don't seem to even be getting series for processors dropped/refused metrics (they do appear when i.e. the memory limiter kicks in) but yeah there doesn't seem to be anything getting actively failed/dropped/refused.
We had a discussion on Friday and based on what we are seeing I think the suspicion is that the kubelet series are occasionally being marked as stale for some reason. This is based on the fact that these metrics are being remote written to and queried from Thanos, which has a 5 minute lookback window. When we see gaps in queries, the equivalent gaps in the debug exporter are less than 5 minutes, so in that case we would expect Thanos to take the last sent value until either a new value is sent or 5 minutes passes. This implies it's not just that metrics aren't being sent, but potentially something is being actively sent to mark the series as stale.
I'm still trying to track this with the debug exporter to see if the last value sent before the gaps is a NaN or equivalent but if anyone has more context and could help us debug if that is the case that would be much appreciated.
I did also see some otel_scraper_ metrics in the internal telemetry that might be relevant for us if just to rule out an error on scraping but they don't seem to appear in my collector metrics (Running 0.133.0 now and I have the level set to normal which should include them). I'd expect at least the otelcol_scraper_scraped_metric_points series to appear but maybe this isn't implemented by the prometheus receiver.
Yeah, the prometheus receiver doesn't use the common scraper library. It uses the Promethues server's scraping logic
@alclark704 @alclark704 my finds and trials so far:
I am seeing target metrics go missing rather than just returning up == 0. It looks like the entire target disappears for a while. That includes up, scrape_.*, and other series. I suspect something is stalling the scrape path or the target is being dropped.
Setup
I am running a simple layout:
- Target Allocator using Prometheus Operator CRs to discover targets and distribute them across collector instances
- OpenTelemetry Collector as a StatefulSet with:
- Prometheus Receiver integrated with the Target Allocator
- Prometheus Remote Write Receiver
How I scrape kubelet
I use the Prometheus Operator strategy.
apiVersion: v1
kind: Service
metadata:
name: prometheus-operator-kubelet
namespace: kube-system
spec:
clusterIP: None
clusterIPs:
- None
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
- IPv6
ipFamilyPolicy: RequireDualStack
ports:
- name: https-metrics
port: 10250
protocol: TCP
targetPort: 10250
- name: http-metrics
port: 10255
protocol: TCP
targetPort: 10255
- name: cadvisor
port: 4194
protocol: TCP
targetPort: 4194
sessionAffinity: None
type: ClusterIP
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: prometheus-operator-kubelet
spec:
attachMetadata:
node: false
endpoints:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
honorLabels: true
honorTimestamps: true
interval: 30s
metricRelabelings:
- action: replace
sourceLabels:
- __metrics_path__
targetLabel: metrics_path
port: https-metrics
scheme: https
tlsConfig:
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecureSkipVerify: true
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
honorLabels: true
honorTimestamps: true
interval: 30s
metricRelabelings:
- action: labeldrop
regex: (id|name)
path: /metrics/cadvisor
port: https-metrics
scheme: https
tlsConfig:
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecureSkipVerify: true
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
honorLabels: true
honorTimestamps: true
interval: 30s
metricRelabelings:
- action: drop
regex: prober_probe_(duration_seconds.*|total)
sourceLabels:
- __name__
path: /metrics/probes
port: https-metrics
scheme: https
tlsConfig:
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecureSkipVerify: true
jobLabel: k8s-app
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
app.kubernetes.io/name: kubelet
k8s-app: kubelet
This is a different setup from @alclark704, which looks like it uses the API server proxy.
What I tried
- Enabled debug and development logs on the Target Allocator and the Collector. I expected clear errors such as scrape failures or timeouts. I did not see that. If the scrape had failed I would still expect up == 0, not the entire series missing.
- Disabled honor timestamps. I read a few threads about cAdvisor adding timestamps that could mark series as stale. Disabling did not help. Also, even if timestamps were the cause, I would expect a few series to drop, not every series including up and scrape_*.
- Switched to OTLP exporter. I moved from Prometheus Remote Write to OTLP just to rule out exporter issues. The gap still happens.
- Moved kubelet scrape from StatefulSet to DaemonSet. We use OpenTelemetry for logs and traces too, so I moved the kubelet scrape to the agent DaemonSet for a test. That works. No gaps. Good on one side, but it raises the question below.
Open question
Why does the scrape work when collectors run as DaemonSet agents but shows missing metrics when collectors run as a StatefulSet, even though I am not using the API server proxy in either case
Update from my side at least. It appears we were hitting this separate issue in the targetallocator: https://github.com/open-telemetry/opentelemetry-operator/issues/4072
Which makes sense why we couldn't track down the issue in the collector itself! So upgrading the collector and TA has resolved the gaps for myself.