prometheus-adapter icon indicating copy to clipboard operation
prometheus-adapter copied to clipboard

Correct Configuration Fails to Provide Expected Custom Metrics in EKS

Open wuyudian1 opened this issue 8 months ago • 2 comments

What happened?: Correct Configuration Fails to Provide Expected Custom Metrics in EKS We have deployed identical Prometheus chart and Prometheus-Adapter chart in both Alibaba Cloud ACK cluster and AWS EKS cluster. The configurations of Prometheus and Prometheus-Adapter are the same in both K8S clusters. The scraping configuration for Prometheus is as follows:

job_name: basicai-business-queue-wait
metrics_path: /metrics/prometheus
scheme: http
scrape_interval: 30s
honor_labels: true
kubernetes_sd_configs:
  - role: service
    namespaces:
      names:
        - basicai-backend
        - basicai-stage-backend
relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
    regex: dataset
    action: keep
  - source_labels: [__meta_kubernetes_namespace]
    target_label: 'kubernetes_namespace'
    action: replace
  - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
    target_label: 'kubernetes_deployment'
    action: replace
  - source_labels: [__meta_kubernetes_service_port_number]
    regex: 80
    action: keep

The values.yaml for Prometheus-Adapter chart is as follows:

image:
  repository: registry.talos.basic.ai/common/images/prometheus-adapter
  tag: "v0.11.2"
  pullPolicy: IfNotPresent
prometheus:
  url: http://prometheus-server
  port: 80
resources:
   requests:
     cpu: 100m
     memory: 128Mi
   limits:
     cpu: 100m
     memory: 1Gi
rules:
  default: false
  custom:
  - seriesQuery: '{__name__=~"basicai_job_replica_scale_percent",container!="POD",kubernetes_namespace!="",type="dataset-upload"}'
    resources:
      template: <<.Resource>>
      overrides:
        kubernetes_namespace: {resource: "namespace"}
        kubernetes_deployment: {resource: "deployment"}
    name:
      matches: "basicai_job_replica_scale_percent"
      as: "upload_job_replica_scale_percent_dataset"
    metricsQuery: last_over_time(basicai_job_replica_scale_percent{<<.LabelMatchers>>,type="dataset-upload"}[5m])

In the Alibaba Cloud ACK cluster, the Prometheus-Adapter correctly provides custom metrics:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "custom.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "deployments.apps/upload_job_replica_scale_percent_dataset",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "namespaces/upload_job_replica_scale_percent_dataset",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "jobs.batch/upload_job_replica_scale_percent_dataset",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

However, in the EKS cluster, the Prometheus-Adapter provides a large number of default metrics, but does not include the expected 'upload_job_replica_scale_percent_dataset':

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq | head -n 50
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "custom.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "services/authentication_duration_seconds_sum",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    .....
    .....
    .....

What did you expect to happen?: prometheus-adapter provides correct custom metrics in AWS EKS cluster as in Alibaba Cloud ACK cluster

Please provide the prometheus-adapter config:

image:
  repository: registry.talos.basic.ai/common/images/prometheus-adapter
  tag: "v0.11.2"
  pullPolicy: IfNotPresent
prometheus:
  url: http://prometheus-server
  port: 80
resources:
   requests:
     cpu: 100m
     memory: 128Mi
   limits:
     cpu: 100m
     memory: 1Gi
rules:
  default: false
  custom:
  - seriesQuery: '{__name__=~"basicai_job_replica_scale_percent",container!="POD",kubernetes_namespace!="",type="dataset-upload"}'
    resources:
      template: <<.Resource>>
      overrides:
        kubernetes_namespace: {resource: "namespace"}
        kubernetes_deployment: {resource: "deployment"}
    name:
      matches: "basicai_job_replica_scale_percent"
      as: "upload_job_replica_scale_percent_dataset"
    metricsQuery: last_over_time(basicai_job_replica_scale_percent{<<.LabelMatchers>>,type="dataset-upload"}[5m])

wuyudian1 avatar Jun 04 '24 09:06 wuyudian1