aws-otel-collector icon indicating copy to clipboard operation
aws-otel-collector copied to clipboard

`awsemf` exporter does not export metrics with no labels

Open lorelei-rupp-imprivata opened this issue 2 years ago • 16 comments

Describe the bug Cluster Autoscaler has metrics on the /metrics endpoint such as

TYPE cluster_autoscaler_cluster_safe_to_autoscale gauge cluster_autoscaler_cluster_safe_to_autoscale 1

This can be seen when you hit the /metrics endpoint in the cluster Cluster autoscaler also has other metrics with labels such as

TYPE cluster_autoscaler_function_duration_seconds histogram cluster_autoscaler_function_duration_seconds_bucket{function="filterOutSchedulable",le="0.01"} 17095 TYPE cluster_autoscaler_nodes_count gauge cluster_autoscaler_nodes_count{state="longUnregistered"} 0 cluster_autoscaler_nodes_count{state="notStarted"} 0 cluster_autoscaler_nodes_count{state="ready"} 15 cluster_autoscaler_nodes_count{state="unready"} 0 cluster_autoscaler_nodes_count{state="unregistered"} 0

With the following prometheus reciever config it seems to ONLY scrape those with labels, and not the cluster_autoscaler_cluster_safe_to_autoscale

Is this a limitation of the reciever?

  prometheus:
    config:
      global:
        scrape_interval: 1m
        scrape_timeout: 10s
      scrape_configs:
        - job_name: 'cluster-autoscaler'
          metrics_path: /metrics
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - action: keep
              regex: true
              source_labels:
                - __meta_kubernetes_pod_annotation_prometheus_io_scrape

lorelei-rupp-imprivata avatar Mar 05 '22 14:03 lorelei-rupp-imprivata

Can you provide some more details about your setup? Which version of the collector are you using? Can you provide the rest of your pipeline configuration?

I created a simple test for a metric with no labels and it appears to function properly, at least to the point of converting it to pdata and passing it to the next component in the pipeline.

JodRnpK 1

Aneurysm9 avatar Mar 06 '22 18:03 Aneurysm9

Sure, sorry I should have provided more originally

This is my full Config Map

  prometheus:
    config:
      global:
        scrape_interval: 1m
        scrape_timeout: 10s
      scrape_configs:
        - job_name: 'kube-state-metrics'
          static_configs:
            - targets: [ 'kube-state-metrics.kube-system.svc.cluster.local:8080' ]
        - job_name: 'kubernetes-external-secrets'
          sample_limit: 10000
          metrics_path: /metrics
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - action: keep
              regex: true
              source_labels:
                - __meta_kubernetes_pod_annotation_prometheus_io_scrape
            - action: keep
              regex: .*-external-secrets
              source_labels:
                - __meta_kubernetes_pod_container_name
            - action: replace
              source_labels:
                - __meta_kubernetes_pod_node_name
              target_label: node_name
            - action: replace
              source_labels:
                - __meta_kubernetes_pod_name
              target_label: pod_name
            - action: replace
              source_labels:
                - __meta_kubernetes_pod_container_name
              target_label: container_name
        - job_name: 'cluster-autoscaler'
          metrics_path: /metrics
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - action: keep
              regex: true
              source_labels:
                - __meta_kubernetes_pod_annotation_prometheus_io_scrape
            - action: keep
              regex: .*-cluster-autoscaler
              source_labels:
                - __meta_kubernetes_pod_container_name


processors:
  resourcedetection/ec2:
    detectors: [ env ]
    timeout: 2s
    override: false
  resource:
    attributes:
      - key: TaskId
        from_attribute: job
        action: insert
      - key: receiver
        value: "prometheus"
        action: insert

exporters:
  awsemf:
    namespace: ContainerInsights/Prometheus
    log_group_name: "/aws/containerinsights/{ClusterName}/prometheus"
    log_stream_name: "{TaskId}"
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    metric_declarations:
      - dimensions: [ [ ClusterName, deployment, namespace ], [ ClusterName, namespace ], [ ClusterName ]]
        metric_name_selectors:
          - "^kube_deployment_status_replicas_available$"
          - "^kube_deployment_status_replicas$"
          - "^kube_pod_status_ready$"
          - "^kube_pod_status_unschedulable$"
        label_matchers:
          - label_names:
              - service.name
            regex: "^kube-state-metrics$"
      - dimensions: [ [ ClusterName, replicaset, namespace ], [ ClusterName, namespace ], [ ClusterName ]]
        metric_name_selectors:
          - "^kube_replicaset_status_replicas$"
          - "^kube_replicaset_status_ready_replicas$"
        label_matchers:
          - label_names:
              - service.name
            regex: "^kube-state-metrics$"
      - dimensions: [ [ ClusterName, daemonset, namespace ], [ ClusterName, namespace ], [ ClusterName ]]
        metric_name_selectors:
          - "^kube_daemonset_status_desired_number_scheduled$"
          - "^kube_daemonset_status_number_ready$"
        label_matchers:
          - label_names:
              - service.name
            regex: "^kube-state-metrics$"
      - dimensions: [ [ ClusterName, condition ] ]
        metric_name_selectors:
          - "^kube_node_status_condition$"
        label_matchers:
          - label_names:
              - service.name
            regex: "^kube-state-metrics$"
      - dimensions: [ [ ClusterName ] ]
        metric_name_selectors:
          - "^kube_node_info$"
          - "^kube_node_spec_unschedulable$"
        label_matchers:
          - label_names:
              - service.name
            regex: "^kube-state-metrics$"
      - dimensions: [ [ ClusterName ], [ClusterName, name ]]
        metric_name_selectors:
          - "^kubernetes_external_secrets_last_sync_call_state$"
        label_matchers:
          - label_names:
              - container_name
            regex: "^kubernetes-external-secrets$"
#      - dimensions: [ [ ClusterName ] ]
#        metric_name_selectors:
#          - "^cluster_autoscaler_cluster_safe_to_autoscale$"
#        label_matchers:
#          - label_names:
#              - service.name
#            regex: "^cluster-autoscaler$"
#      - dimensions: [ [ ClusterName, state ] ]
#        metric_name_selectors:
#          - "^cluster_autoscaler_nodes_count$"
#        label_matchers:
#          - label_names:
#              - container_name
#            regex: "^aws-cluster-autoscaler$"

  logging:
    loglevel: debug

extensions:
  pprof:

service:
  pipelines:
    metrics:
      receivers: [ prometheus ]
      processors: [ resourcedetection/ec2, resource ]
      exporters: [ awsemf ]

I am running this in an EKS cluster with the latest aws otel collector. I followed https://aws-otel.github.io/docs/getting-started/container-insights/eks-prometheus to get set up. I am also using the "auto discovery" it just is odd it doesn't pick up any metrics that have no label/dimensions.

I am going to try to use a direct "target" vs the auto discovery next

lorelei-rupp-imprivata avatar Mar 06 '22 19:03 lorelei-rupp-imprivata

even switching to use

          static_configs:
            - targets: [ 'cluster-autoscaler-aws-cluster-autoscaler.kube-system.svc.cluster.local:8085' ]

They still do not show up in the log group in AWS Cloud Watch It seems to only scrape/pull in metrics with labels/dimensions. Cluster Autoscaler has a bunch of metrics it exports that do not have labels/dimensions on them

I can curl the cluster autoscaler /metrics endpoint in the cluster and see all the metrics that are available for scraping

Maybe I am just doing something wrong

lorelei-rupp-imprivata avatar Mar 06 '22 19:03 lorelei-rupp-imprivata

If you enable the logging exporter does the cluster_autoscaler_cluster_safe_to_autoscale metric appear in the logs? Or, if you add a prometheus exporter to the pipeline does that metric appear in the served exposition?

Aneurysm9 avatar Mar 06 '22 19:03 Aneurysm9

If you enable the logging exporter does the cluster_autoscaler_cluster_safe_to_autoscale metric appear in the logs? Or, if you add a prometheus exporter to the pipeline does that metric appear in the served exposition?

My configmap above already has the logging exporter enabled I thought. If I look at the adot otel pod, nothing really stands out in the logs to me there either. If I look at Cloud Watch log groups, nothing even shows up there. So to me, the issue is on the receiver side, it doesn't even make it to the metrics because its not in the Cloud Watch logs. Is there a way to gather any more debug on the receiver section so I can get more logs as to what is going on.

lorelei-rupp-imprivata avatar Mar 06 '22 19:03 lorelei-rupp-imprivata

Oh I was missing the logging in the pipeline exporters. I just enabled this and now there is a lot of logs in the pod logs for the collector. I see the metric I want now on the pod logs of the collector

Metric #3
Descriptor:
     -> Name: cluster_autoscaler_cluster_safe_to_autoscale
     -> Description: [ALPHA] Whether or not cluster is healthy enough for autoscaling. 1 if it is, 0 otherwise.
     -> Unit: 
     -> DataType: Gauge
NumberDataPoints #0
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2022-03-06 19:55:58.023 +0000 UTC
Value: 1.000000

lorelei-rupp-imprivata avatar Mar 06 '22 19:03 lorelei-rupp-imprivata

It also only has 1 data point. Compared to metrics I do see make it to Cloud Watch logs have multiple data points

Metric #41
Descriptor:
     -> Name: cluster_autoscaler_nodes_count
     -> Description: [ALPHA] Number of nodes in cluster.
     -> Unit: 
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> state: STRING(longUnregistered)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2022-03-06 19:55:58.023 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> state: STRING(notStarted)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2022-03-06 19:55:58.023 +0000 UTC
Value: 0.000000
NumberDataPoints #2
Data point attributes:
     -> state: STRING(ready)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2022-03-06 19:55:58.023 +0000 UTC
Value: 15.000000
NumberDataPoints #3
Data point attributes:
     -> state: STRING(unready)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2022-03-06 19:55:58.023 +0000 UTC
Value: 0.000000
NumberDataPoints #4
Data point attributes:
     -> state: STRING(unregistered)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2022-03-06 19:55:58.023 +0000 UTC
Value: 0.000000

lorelei-rupp-imprivata avatar Mar 06 '22 20:03 lorelei-rupp-imprivata

That's good, it tells us the prometheus receiver is handling this metric correctly. Having a single datapoint is expected as each unique combination of labels and label values is translated to a datapoint and there's only one unique combination of zero labels.

With the metric making it this far, the issue will be somewhere in the awsemf exporter. The configuration you provided seems to have commented out the section handling this metric:

#      - dimensions: [ [ ClusterName ] ]
#        metric_name_selectors:
#          - "^cluster_autoscaler_cluster_safe_to_autoscale$"
#        label_matchers:
#          - label_names:
#              - service.name
#            regex: "^cluster-autoscaler$"

With that uncommented I suspect the next issue would be the lack of a ClusterName attribute to use as a dimension. Can you add one through relabeling in the prometheus receiver or with the resource processor?

Aneurysm9 avatar Mar 07 '22 00:03 Aneurysm9

That's good, it tells us the prometheus receiver is handling this metric correctly. Having a single datapoint is expected as each unique combination of labels and label values is translated to a datapoint and there's only one unique combination of zero labels.

With the metric making it this far, the issue will be somewhere in the awsemf exporter. The configuration you provided seems to have commented out the section handling this metric:

#      - dimensions: [ [ ClusterName ] ]
#        metric_name_selectors:
#          - "^cluster_autoscaler_cluster_safe_to_autoscale$"
#        label_matchers:
#          - label_names:
#              - service.name
#            regex: "^cluster-autoscaler$"

With that uncommented I suspect the next issue would be the lack of a ClusterName attribute to use as a dimension. Can you add one through relabeling in the prometheus receiver or with the resource processor?

So One question I have is if you leave out the exporter config, you still get the metrics scraped into CloudWatch Log groups. This is why I thought it was receiver related, because it is not showing up there. The exporter is to make it show up in Cloud Watch Metrics, but if its not in the log group you definitely cannot get it into the Metrics. Is this not a correct assumption? Is the exporter also how it ends up in the log groups? That would make more sense then with what your saying

I have that stuff commented out because it doesn't work because the metric isn't in the log group.

Also I am already injecting the ClusterName into everything via this

processors:
  resourcedetection/ec2:
    detectors: [ env ]
    timeout: 2s
    override: false

lorelei-rupp-imprivata avatar Mar 07 '22 13:03 lorelei-rupp-imprivata

Could my issue be related to https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/4739

lorelei-rupp-imprivata avatar Mar 07 '22 13:03 lorelei-rupp-imprivata

I tried adding a label here

        - job_name: 'cluster-autoscaler'
          static_configs:
            - targets: [ 'cluster-autoscaler-aws-cluster-autoscaler.kube-system.svc.cluster.local:8085' ]
              labels:
                test: 'lorelei'

and then in the exporter doing

      - dimensions: [ [ test ] ]
        metric_name_selectors:
          - "^cluster_autoscaler_cluster_safe_to_autoscale$"
        label_matchers:
          - label_names:
              - service.name
            regex: "^cluster-autoscaler$"

But that did not seem to work either

lorelei-rupp-imprivata avatar Mar 07 '22 15:03 lorelei-rupp-imprivata

Could my issue be related to https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/4739

That seems a likely candidate. I will see if we can get someone investigating that issue further.

Aneurysm9 avatar Mar 07 '22 18:03 Aneurysm9

Thank you @Aneurysm9 !

lorelei-rupp-imprivata avatar Mar 07 '22 18:03 lorelei-rupp-imprivata

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions[bot] avatar Jul 17 '22 20:07 github-actions[bot]

A PR that will solve this issue is available in https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/13766

rapphil avatar Aug 31 '22 23:08 rapphil

This fix will be available in the AWS Otel Collector v0.22.0

rapphil avatar Sep 13 '22 17:09 rapphil

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions[bot] avatar Nov 13 '22 20:11 github-actions[bot]