aws-otel-collector
aws-otel-collector copied to clipboard
`awsemf` exporter does not export metrics with no labels
Describe the bug Cluster Autoscaler has metrics on the /metrics endpoint such as
TYPE cluster_autoscaler_cluster_safe_to_autoscale gauge cluster_autoscaler_cluster_safe_to_autoscale 1
This can be seen when you hit the /metrics endpoint in the cluster Cluster autoscaler also has other metrics with labels such as
TYPE cluster_autoscaler_function_duration_seconds histogram cluster_autoscaler_function_duration_seconds_bucket{function="filterOutSchedulable",le="0.01"} 17095 TYPE cluster_autoscaler_nodes_count gauge cluster_autoscaler_nodes_count{state="longUnregistered"} 0 cluster_autoscaler_nodes_count{state="notStarted"} 0 cluster_autoscaler_nodes_count{state="ready"} 15 cluster_autoscaler_nodes_count{state="unready"} 0 cluster_autoscaler_nodes_count{state="unregistered"} 0
With the following prometheus reciever config it seems to ONLY scrape those with labels, and not the cluster_autoscaler_cluster_safe_to_autoscale
Is this a limitation of the reciever?
prometheus:
config:
global:
scrape_interval: 1m
scrape_timeout: 10s
scrape_configs:
- job_name: 'cluster-autoscaler'
metrics_path: /metrics
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
Can you provide some more details about your setup? Which version of the collector are you using? Can you provide the rest of your pipeline configuration?
I created a simple test for a metric with no labels and it appears to function properly, at least to the point of converting it to pdata
and passing it to the next component in the pipeline.
Sure, sorry I should have provided more originally
This is my full Config Map
prometheus:
config:
global:
scrape_interval: 1m
scrape_timeout: 10s
scrape_configs:
- job_name: 'kube-state-metrics'
static_configs:
- targets: [ 'kube-state-metrics.kube-system.svc.cluster.local:8080' ]
- job_name: 'kubernetes-external-secrets'
sample_limit: 10000
metrics_path: /metrics
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- action: keep
regex: .*-external-secrets
source_labels:
- __meta_kubernetes_pod_container_name
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: node_name
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: pod_name
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
target_label: container_name
- job_name: 'cluster-autoscaler'
metrics_path: /metrics
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- action: keep
regex: .*-cluster-autoscaler
source_labels:
- __meta_kubernetes_pod_container_name
processors:
resourcedetection/ec2:
detectors: [ env ]
timeout: 2s
override: false
resource:
attributes:
- key: TaskId
from_attribute: job
action: insert
- key: receiver
value: "prometheus"
action: insert
exporters:
awsemf:
namespace: ContainerInsights/Prometheus
log_group_name: "/aws/containerinsights/{ClusterName}/prometheus"
log_stream_name: "{TaskId}"
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
metric_declarations:
- dimensions: [ [ ClusterName, deployment, namespace ], [ ClusterName, namespace ], [ ClusterName ]]
metric_name_selectors:
- "^kube_deployment_status_replicas_available$"
- "^kube_deployment_status_replicas$"
- "^kube_pod_status_ready$"
- "^kube_pod_status_unschedulable$"
label_matchers:
- label_names:
- service.name
regex: "^kube-state-metrics$"
- dimensions: [ [ ClusterName, replicaset, namespace ], [ ClusterName, namespace ], [ ClusterName ]]
metric_name_selectors:
- "^kube_replicaset_status_replicas$"
- "^kube_replicaset_status_ready_replicas$"
label_matchers:
- label_names:
- service.name
regex: "^kube-state-metrics$"
- dimensions: [ [ ClusterName, daemonset, namespace ], [ ClusterName, namespace ], [ ClusterName ]]
metric_name_selectors:
- "^kube_daemonset_status_desired_number_scheduled$"
- "^kube_daemonset_status_number_ready$"
label_matchers:
- label_names:
- service.name
regex: "^kube-state-metrics$"
- dimensions: [ [ ClusterName, condition ] ]
metric_name_selectors:
- "^kube_node_status_condition$"
label_matchers:
- label_names:
- service.name
regex: "^kube-state-metrics$"
- dimensions: [ [ ClusterName ] ]
metric_name_selectors:
- "^kube_node_info$"
- "^kube_node_spec_unschedulable$"
label_matchers:
- label_names:
- service.name
regex: "^kube-state-metrics$"
- dimensions: [ [ ClusterName ], [ClusterName, name ]]
metric_name_selectors:
- "^kubernetes_external_secrets_last_sync_call_state$"
label_matchers:
- label_names:
- container_name
regex: "^kubernetes-external-secrets$"
# - dimensions: [ [ ClusterName ] ]
# metric_name_selectors:
# - "^cluster_autoscaler_cluster_safe_to_autoscale$"
# label_matchers:
# - label_names:
# - service.name
# regex: "^cluster-autoscaler$"
# - dimensions: [ [ ClusterName, state ] ]
# metric_name_selectors:
# - "^cluster_autoscaler_nodes_count$"
# label_matchers:
# - label_names:
# - container_name
# regex: "^aws-cluster-autoscaler$"
logging:
loglevel: debug
extensions:
pprof:
service:
pipelines:
metrics:
receivers: [ prometheus ]
processors: [ resourcedetection/ec2, resource ]
exporters: [ awsemf ]
I am running this in an EKS cluster with the latest aws otel collector. I followed https://aws-otel.github.io/docs/getting-started/container-insights/eks-prometheus to get set up. I am also using the "auto discovery" it just is odd it doesn't pick up any metrics that have no label/dimensions.
I am going to try to use a direct "target" vs the auto discovery next
even switching to use
static_configs:
- targets: [ 'cluster-autoscaler-aws-cluster-autoscaler.kube-system.svc.cluster.local:8085' ]
They still do not show up in the log group in AWS Cloud Watch It seems to only scrape/pull in metrics with labels/dimensions. Cluster Autoscaler has a bunch of metrics it exports that do not have labels/dimensions on them
I can curl the cluster autoscaler /metrics endpoint in the cluster and see all the metrics that are available for scraping
Maybe I am just doing something wrong
If you enable the logging
exporter does the cluster_autoscaler_cluster_safe_to_autoscale
metric appear in the logs? Or, if you add a prometheus
exporter to the pipeline does that metric appear in the served exposition?
If you enable the
logging
exporter does thecluster_autoscaler_cluster_safe_to_autoscale
metric appear in the logs? Or, if you add aprometheus
exporter to the pipeline does that metric appear in the served exposition?
My configmap above already has the logging exporter enabled I thought. If I look at the adot otel pod, nothing really stands out in the logs to me there either. If I look at Cloud Watch log groups, nothing even shows up there. So to me, the issue is on the receiver side, it doesn't even make it to the metrics because its not in the Cloud Watch logs. Is there a way to gather any more debug on the receiver section so I can get more logs as to what is going on.
Oh I was missing the logging in the pipeline exporters. I just enabled this and now there is a lot of logs in the pod logs for the collector. I see the metric I want now on the pod logs of the collector
Metric #3
Descriptor:
-> Name: cluster_autoscaler_cluster_safe_to_autoscale
-> Description: [ALPHA] Whether or not cluster is healthy enough for autoscaling. 1 if it is, 0 otherwise.
-> Unit:
-> DataType: Gauge
NumberDataPoints #0
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2022-03-06 19:55:58.023 +0000 UTC
Value: 1.000000
It also only has 1 data point. Compared to metrics I do see make it to Cloud Watch logs have multiple data points
Metric #41
Descriptor:
-> Name: cluster_autoscaler_nodes_count
-> Description: [ALPHA] Number of nodes in cluster.
-> Unit:
-> DataType: Gauge
NumberDataPoints #0
Data point attributes:
-> state: STRING(longUnregistered)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2022-03-06 19:55:58.023 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
-> state: STRING(notStarted)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2022-03-06 19:55:58.023 +0000 UTC
Value: 0.000000
NumberDataPoints #2
Data point attributes:
-> state: STRING(ready)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2022-03-06 19:55:58.023 +0000 UTC
Value: 15.000000
NumberDataPoints #3
Data point attributes:
-> state: STRING(unready)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2022-03-06 19:55:58.023 +0000 UTC
Value: 0.000000
NumberDataPoints #4
Data point attributes:
-> state: STRING(unregistered)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2022-03-06 19:55:58.023 +0000 UTC
Value: 0.000000
That's good, it tells us the prometheus
receiver is handling this metric correctly. Having a single datapoint is expected as each unique combination of labels and label values is translated to a datapoint and there's only one unique combination of zero labels.
With the metric making it this far, the issue will be somewhere in the awsemf
exporter. The configuration you provided seems to have commented out the section handling this metric:
# - dimensions: [ [ ClusterName ] ]
# metric_name_selectors:
# - "^cluster_autoscaler_cluster_safe_to_autoscale$"
# label_matchers:
# - label_names:
# - service.name
# regex: "^cluster-autoscaler$"
With that uncommented I suspect the next issue would be the lack of a ClusterName
attribute to use as a dimension. Can you add one through relabeling in the prometheus
receiver or with the resource
processor?
That's good, it tells us the
prometheus
receiver is handling this metric correctly. Having a single datapoint is expected as each unique combination of labels and label values is translated to a datapoint and there's only one unique combination of zero labels.With the metric making it this far, the issue will be somewhere in the
awsemf
exporter. The configuration you provided seems to have commented out the section handling this metric:# - dimensions: [ [ ClusterName ] ] # metric_name_selectors: # - "^cluster_autoscaler_cluster_safe_to_autoscale$" # label_matchers: # - label_names: # - service.name # regex: "^cluster-autoscaler$"
With that uncommented I suspect the next issue would be the lack of a
ClusterName
attribute to use as a dimension. Can you add one through relabeling in theprometheus
receiver or with theresource
processor?
So One question I have is if you leave out the exporter config, you still get the metrics scraped into CloudWatch Log groups. This is why I thought it was receiver related, because it is not showing up there. The exporter is to make it show up in Cloud Watch Metrics, but if its not in the log group you definitely cannot get it into the Metrics. Is this not a correct assumption? Is the exporter also how it ends up in the log groups? That would make more sense then with what your saying
I have that stuff commented out because it doesn't work because the metric isn't in the log group.
Also I am already injecting the ClusterName into everything via this
processors:
resourcedetection/ec2:
detectors: [ env ]
timeout: 2s
override: false
Could my issue be related to https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/4739
I tried adding a label here
- job_name: 'cluster-autoscaler'
static_configs:
- targets: [ 'cluster-autoscaler-aws-cluster-autoscaler.kube-system.svc.cluster.local:8085' ]
labels:
test: 'lorelei'
and then in the exporter doing
- dimensions: [ [ test ] ]
metric_name_selectors:
- "^cluster_autoscaler_cluster_safe_to_autoscale$"
label_matchers:
- label_names:
- service.name
regex: "^cluster-autoscaler$"
But that did not seem to work either
Could my issue be related to https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/4739
That seems a likely candidate. I will see if we can get someone investigating that issue further.
Thank you @Aneurysm9 !
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.
A PR that will solve this issue is available in https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/13766
This fix will be available in the AWS Otel Collector v0.22.0
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.