aws-otel-community icon indicating copy to clipboard operation
aws-otel-community copied to clipboard

Metrics Path and Port don't change according to pod annotations and metrics without a type get ignored completely

Open Gaby-Soto opened this issue 3 years ago • 0 comments

Describe the bugs

  • The collector doesn't check the metrics path nor port annotated on pods
  • Can't collect all spark executor metrics (These metrics don't have a type assigned on them)

Steps to reproduce If trying to scrape metrics using the Prometheus Receiver under:

data:
  adot-collector-config: |
    receivers:
      prometheus:

These scrape configs wont work:

- action: keep
  regex: true
  source_labels:
  - __meta_kubernetes_service_annotation_prometheus_io_scrape
- action: replace
  regex: (.+)
  source_labels:
  - __meta_kubernetes_service_annotation_prometheus_io_path
  target_label: __metrics_path__
- action: replace
  regex: ([^:]+)(?::\d+)?;(\d+)
  replacement: $1:$2
  source_labels:
  - __address__
  - __meta_kubernetes_service_annotation_prometheus_io_port

What did you expect to see? I expected adot to collect the metrics in the metrics path and port annotated in the pod.

What did you see instead? My pods had in their annotations both the metrics path and the port from where to collect the metrics. I saw in the debug logs that the collector never checked neither the port nor the metrics path specified.

Workaround I had to manually write the metrics path for each of my pods in the configuration file below like so:

metrics_path: "/metrics/executors/prometheus"

And under Service I had to specify the path:

spec:
  ports:
  - name: executor-metrics 
    port: 4040

Additional Information (The Second Issue) After I finally set the metrics path and port explicitly I noticed it wasn't collecting all the Spark job metrics, this is due to them having no type defined, therefore adot just drops them and they all get ignored. I found myself with this issue. Only the metrics that had the suffix _total in their names were collected, all the others were ignored. I didn't find a workaround for this last issue and wasn't able to collect all the spark metrics.

Environment AWS EKS Cluster

Full Configuration Used I basically just do kubectl apply -f filename.yml with the configuration file below:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: adot-collector-conf
  namespace: adot-col
  labels:
    app: aws-adot
    component: adot-collector-conf
data:
  adot-collector-config: |
    receivers:
      prometheus:
        config:
          global:
            evaluation_interval: 1m
            scrape_interval: 1m
            scrape_timeout: 30s
            

          scrape_configs:
          - job_name: 'kubernetes-service-endpoints-spark'
            tls_config:
                ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                insecure_skip_verify: true
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            metrics_path: "/metrics/executors/prometheus"
            kubernetes_sd_configs:
            - role: endpoints
            relabel_configs:
            - action: keep
              regex: true
              source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            - action: replace
              regex: (.+)
              source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
              target_label: __metrics_path__
            - action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $$1:$$2
              source_labels: [__address__,__meta_kubernetes_service_annotation_prometheus_io_port]
              target_label: __address__

          - job_name: 'kubernetes-executor-pods-spark'
            tls_config:
                ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                insecure_skip_verify: true
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            metrics_path: "/metrics/executors/prometheus"
            kubernetes_sd_configs:
              - role: pod
            relabel_configs:
            - action: keep
              regex: true
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scrape
            - action: replace
              regex: (https?)
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scheme
              target_label: __scheme__
            - action: replace
              regex: (.+)
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_path
              target_label: __metrics_path__
            - action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $1:$2
              source_labels:
              - __address__
              - __meta_kubernetes_pod_annotation_prometheus_io_port
              target_label: __address__
            - action: labelmap
              regex: __meta_kubernetes_pod_label_(.+)
            - action: replace
              source_labels:
              - __meta_kubernetes_namespace
              target_label: kubernetes_namespace
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_name
              target_label: kubernetes_pod_name

          - job_name: 'kubernetes-executor-pods-spark-slow'
            tls_config:
                ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                insecure_skip_verify: true
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            metrics_path: "/metrics/executors/prometheus"
            scrape_interval: 2m
            scrape_timeout: 40s
            kubernetes_sd_configs:
            - role: pod
            relabel_configs:
            - action: keep
              regex: true
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scrape_slow
            - action: replace
              regex: (https?)
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scheme
              target_label: __scheme__
            - action: replace
              regex: (.+)
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_path
              target_label: __metrics_path__
            - action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $1:$2
              source_labels:
              - __address__
              - __meta_kubernetes_pod_annotation_prometheus_io_port
              target_label: __address__
            - action: labelmap
              regex: __meta_kubernetes_pod_label_(.+)
            - action: replace
              source_labels:
              - __meta_kubernetes_namespace
              target_label: kubernetes_namespace
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_name
              target_label: kubernetes_pod_name
          
          - job_name: 'kubernetes-driver-pods-spark'
            tls_config:
                ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                insecure_skip_verify: true
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            metrics_path: "/metrics/driver/prometheus" # spark.metrics.conf.*.sink.prometheusServlet.path
            kubernetes_sd_configs:
              - role: pod
            relabel_configs:
            - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
              action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $1:$2
              target_label: __address__
            - action: labelmap
              regex: __meta_kubernetes_pod_label_(.+)
            - source_labels: [__meta_kubernetes_namespace]
              action: replace
              target_label: kubernetes_namespace
            - source_labels: [__meta_kubernetes_pod_name]
              action: replace
              target_label: kubernetes_pod_name
            

    exporters:
      awsprometheusremotewrite:
        # replace this with your endpoint
        endpoint: <removing this for privacy>
        # replace this with your region
        aws_auth:
          region: <removing this for privacy>
          service: "aps"
        namespace: "adot"
      logging:
        loglevel: debug

    extensions:
      health_check:
        endpoint: :13133
      pprof:
        endpoint: :1777
      zpages:
        endpoint: :55679

    service:
      extensions: [pprof, zpages, health_check]
      pipelines:
        metrics:
          receivers: [prometheus]
          exporters: [logging, awsprometheusremotewrite]
---
# create adot-col service account and role binding
apiVersion: v1
kind: ServiceAccount
metadata:
  name: amp-iamproxy-ingest-service-account
  namespace: adot-col
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::<removing this for privacy>:role/amp-iamproxy-ingest-role

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: adotcol-admin-role
rules:
  - apiGroups: [""]
    resources:
    - nodes
    - nodes/proxy
    - services
    - endpoints
    - pods
    verbs: ["get", "list", "watch"]
  - apiGroups:
    - extensions
    resources:
    - ingresses
    verbs: ["get", "list", "watch"]
  - nonResourceURLs: ["/metrics"]
    verbs: ["get"]

---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: adotcol-admin-role-binding
subjects:
  - kind: ServiceAccount
    name: amp-iamproxy-ingest-service-account
    namespace: adot-col
roleRef:
  kind: ClusterRole
  name: adotcol-admin-role
  apiGroup: rbac.authorization.k8s.io

---
apiVersion: v1
kind: Service
metadata:
  name: adot-collector
  namespace: adot-col
  labels:
    app: aws-adot
    component: adot-collector
spec:
  ports:
  - name: executor-metrics 
    port: 4040
  - name: metrics # Default endpoint for querying metrics.
    port: 8888
  selector:
    component: adot-collector
  type: NodePort
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: adot-collector
  namespace: adot-col
  labels:
    app: aws-adot
    component: adot-collector
spec:
  selector:
    matchLabels:
      app: aws-adot
      component: adot-collector
  minReadySeconds: 5
  template:
    metadata:
      labels:
        app: aws-adot
        component: adot-collector
    spec:
      serviceAccountName: amp-iamproxy-ingest-service-account
      containers:
      - command:
          - "/awscollector"
          - "--config=/conf/adot-collector-config.yaml"
        image: public.ecr.aws/aws-observability/aws-otel-collector:latest
        name: adot-collector
        resources:
          limits:
            cpu: 1
            memory: 2Gi
          requests:
            cpu: 200m
            memory: 400Mi
        ports:
        - containerPort: 8888  # Default endpoint for querying metrics.
        volumeMounts:
        - name: adot-collector-config-vol
          mountPath: /conf
        livenessProbe:
          httpGet:
            path: /
            port: 13133 # Health Check extension port.
        readinessProbe:
          httpGet:
            path: /
            port: 13133 # Health Check extension port.
      volumes:
        - configMap:
            name: adot-collector-conf
            items:
              - key: adot-collector-config
                path: adot-collector-config.yaml
          name: adot-collector-config-vol
---

Example of metrics I'm trying to scrape Only the metrics that have the _total suffix get scraped, all the others get ignored (for example: metrics_executor_failedTasks_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"})

spark_info{version="3.1.1", revision=""} 1.0
metrics_executor_rddBlocks{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_memoryUsed_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 1816
metrics_executor_diskUsed_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_totalCores{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_maxTasks{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_activeTasks{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_failedTasks_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_completedTasks_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_totalTasks_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_totalDuration_seconds_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0.0
metrics_executor_totalGCTime_seconds_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0.0
metrics_executor_totalInputBytes_bytes_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_totalShuffleRead_bytes_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_totalShuffleWrite_bytes_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_maxMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 1078827417
metrics_executor_usedOnHeapStorageMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 1816
metrics_executor_usedOffHeapStorageMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_totalOnHeapStorageMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 1078827417
metrics_executor_totalOffHeapStorageMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_JVMHeapMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 1224267120
metrics_executor_JVMOffHeapMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 120520560
metrics_executor_OnHeapExecutionMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_OffHeapExecutionMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_OnHeapStorageMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 5000
metrics_executor_OffHeapStorageMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_OnHeapUnifiedMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 5000
metrics_executor_OffHeapUnifiedMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_DirectPoolMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 320589
metrics_executor_MappedPoolMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_ProcessTreeJVMVMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 5970055168
metrics_executor_ProcessTreeJVMRSSMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 1603010560
metrics_executor_ProcessTreePythonVMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_ProcessTreePythonRSSMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_ProcessTreeOtherVMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_ProcessTreeOtherRSSMemory_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0
metrics_executor_MinorGCCount_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 18366
metrics_executor_MajorGCCount_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 18
metrics_executor_MinorGCTime_seconds_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 147.54
metrics_executor_MajorGCTime_seconds_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="driver"} 0.45
metrics_executor_rddBlocks{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="2"} 0
metrics_executor_memoryUsed_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="2"} 1816
metrics_executor_diskUsed_bytes{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="2"} 0
metrics_executor_totalCores{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="2"} 2
metrics_executor_maxTasks{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="2"} 2
metrics_executor_activeTasks{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="2"} 2
metrics_executor_failedTasks_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="2"} 0
metrics_executor_completedTasks_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="2"} 450466
metrics_executor_totalTasks_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="2"} 450468
metrics_executor_totalDuration_seconds_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="2"} 2356.288
metrics_executor_totalGCTime_seconds_total{application_id="spark-af748c06812c416e946c4aedfd4b4c4b", application_name="Spark Pi", executor_id="2"} 11.77
...

@alolita

Gaby-Soto avatar Oct 05 '21 21:10 Gaby-Soto