prometheus-adapter
prometheus-adapter copied to clipboard
Failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API
What happened?: I am constantly having this with one of my service hpa which is configured to scale based on custom metrics. Sometime hpa shows able to scale to True and able to get custom metrics most of the time not. Because of that hpa is not able to scale down the pods.
This our hpa description for one of the affected service.
Affected service hpa description:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False FailedGetPodsMetric the HPA was unable to compute the replica count: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
While the other service using same hpa configuration not showing this error while describing its hpa. This hpa description from another service.
Running service hpa: This is a Random behaviour we observed in both services that sometime its able to collect custom metric and sometime not.
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from pods metric DCGM_FI_DEV_FB_USED_AVG
What did you expect to happen?: Expect the same behaviour of prometheus adapter and hpa among the services if using same configuration for both services.
Please provide the prometheus-adapter config:
prometheus-adapter config
prometheus:
url: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local
port: 9090
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
rules:
default: false
custom:
- seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{exported_namespace!="", exported_container!="", exported_pod!=""}'
name:
as: "DCGM_FI_DEV_GPU_UTIL_AVG"
resources:
overrides:
exported_namespace: {resource: "namespace"}
exported_pod: {resource: "pod"}
exported_container: {resource: "pod"}
metricsQuery: 'avg by (exported_namespace, exported_pod) (round(avg_over_time(DCGM_FI_DEV_GPU_UTIL{exported_pod!="",exported_container!=""}[1m])))'
- seriesQuery: 'DCGM_FI_DEV_FB_USED{exported_namespace!="", exported_container!="", exported_pod!=""}'
name:
as: "DCGM_FI_DEV_FB_USED_AVG"
resources:
overrides:
exported_namespace: {resource: "namespace"}
exported_pod: {resource: "pod"}
exported_container: {resource: "pod"}
metricsQuery: 'avg by (exported_namespace, exported_pod) (round(avg_over_time(DCGM_FI_DEV_FB_USED{exported_pod!="",exported_container!=""}[1m])))'
- seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{exported_namespace!="",exported_container!="",exported_pod!=""}'
name:
as: "DCGM_FI_DEV_GPU_UTIL_MIN"
resources:
overrides:
exported_container: {resource: "service"}
exported_namespace: {resource: "namespace"}
exported_pod: {resource: "pod"}
metricsQuery: min by (exported_namespace, exported_container) (round(min_over_time(<<.Series>>[1m])))
- seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{exported_namespace!="",exported_container!="",exported_pod!=""}'
name:
as: "DCGM_FI_DEV_GPU_UTIL_MAX"
resources:
overrides:
exported_container: {resource: "service"}
exported_namespace: {resource: "namespace"}
exported_pod: {resource: "pod"}
metricsQuery: max by (exported_namespace, exported_container) (round(max_over_time(<<.Series>>[1m])))
When checking if metrics exist or not, got this response:
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM_FI_DEV_FB_USED_AVG
"name": "pods/DCGM_FI_DEV_FB_USED_AVG",
"name": "namespaces/DCGM_FI_DEV_FB_USED_AVG",
Please provide the HPA resource used for autoscaling:
HPA yaml
HPA yaml for both service is here:
Not Working one:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: serviceA-memory-utilization-hpa
namespace: development
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: serviceA
minReplicas: 1
maxReplicas: 2
metrics:
- type: Pods
pods:
metric:
name: DCGM_FI_DEV_FB_USED_AVG
target:
type: AverageValue
averageValue: 20000
Working One:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: serviceB-memory-utilization-hpa
namespace: common-service-development
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: serviceB
minReplicas: 1
maxReplicas: 2
metrics:
- type: Pods
pods:
metric:
name: DCGM_FI_DEV_FB_USED_AVG
target:
type: AverageValue
averageValue: 20000
Please provide the HPA status:
We observed these events in both services time to time, also sometime it is able to collect the metric for ServiceB but not for serviceA most of the time.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedComputeMetricsReplicas 26m (x12 over 30m) horizontal-pod-autoscaler invalid metrics (1 invalid out of 1), first error is: failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API
Warning FailedGetPodsMetric 22s (x74 over 30m) horizontal-pod-autoscaler unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API
And it is the HPA status, it seems it is able to get the memory utilization, but while we describe hpa we observed issues as stated earlier that hpa is unable to collect metrics neither trigger scaling activity.
serviceA-memory-utilization-hpa Deployment/serviceA 19675/20k 1 2 1 14m
serviceB-memory-utilization-hpa Deployment/serviceB 19675/20k 1 2 2 11m
Please provide the prometheus-adapter logs with -v=6 around the time the issue happened:
prometheus-adapter logs
Anything else we need to know?:
Environment:
-
prometheus-adapter version: prometheus-adapter-3.2.2 v0.9.1
-
prometheus version: kube-prometheus-stack-56.6.2 v0.71.2
-
Kubernetes version (use
kubectl version): Client Version: v1.28.3-eks-e71965b Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.26.12-eks-5e0fdde -
Cloud provider or hardware configuration: AWS EKS
-
Other info:
/cc @CatherineF-dev /assign @dgrisonnet /triage accepted
Starting from v0.11.0, the file corresponding to this link does not exist, which may be the cause
Starting from v0.11.0, the file corresponding to this link does not exist, which may be the cause
That is not the only issue, as prometheus adapter failed to get GPU metrics, and not able to scale up and down kubernetes deployment and reflect the error of Failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API in hpa describe command, with unknown status or ScalingActive becomes False.
Just curious, How does the raw data for DCGM_FI_DEV_GPU_UTIL{} from prometheus look like?
something like this @dvp34
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"DCGM_FI_DRIVER_VERSION": "535.171.04",
"Hostname": "qxzg-l4server",
"UUID": "GPU-557dbd17-5aa5-ade0-c563-d44fee17f8bc",
"__name__": "DCGM_FI_DEV_GPU_UTIL",
"container": "nvidia-dcgm-exporter",
"device": "nvidia1",
"endpoint": "gpu-metrics",
"exported_container": "triton",
"exported_namespace": "llm",
"exported_pod": "qwen-1gpu-75455d6c96-7jcxq",
"gpu": "1",
"instance": "10.42.0.213:9400",
"job": "nvidia-dcgm-exporter",
"modelName": "NVIDIA L4",
"namespace": "gpu-operator",
"pod": "nvidia-dcgm-exporter-rlhcx",
"service": "nvidia-dcgm-exporter"
},
"value": [
1719909159.405,
"0"
]
},
{
"metric": {
"DCGM_FI_DRIVER_VERSION": "535.171.04",
"Hostname": "qxzg-l4server",
"UUID": "GPU-557dbd17-5aa5-ade0-c563-d44fee17f8bc",
"__name__": "DCGM_FI_DEV_GPU_UTIL",
"container": "triton",
"device": "nvidia1",
"gpu": "1",
"instance": "10.42.0.213:9400",
"job": "gpu-metrics",
"kubernetes_node": "qxzg-l4server",
"modelName": "NVIDIA L4",
"namespace": "llm",
"pod": "qwen-1gpu-75455d6c96-7jcxq"
},
"value": [
1719909159.405,
"0"
]
},
{
"metric": {
"DCGM_FI_DRIVER_VERSION": "535.171.04",
"Hostname": "qxzg-l4server",
"UUID": "GPU-ec1a0983-4e27-c5c1-16f7-534319ffb62c",
"__name__": "DCGM_FI_DEV_GPU_UTIL",
"container": "nvidia-dcgm-exporter",
"device": "nvidia0",
"endpoint": "gpu-metrics",
"exported_container": "triton",
"exported_namespace": "llm",
"exported_pod": "qwen2-d63eff62-2f6a-427d-b231-e7693a1c2915-747c599cb6-4xjlb",
"gpu": "0",
"instance": "10.42.0.213:9400",
"job": "nvidia-dcgm-exporter",
"modelName": "NVIDIA L4",
"namespace": "gpu-operator",
"pod": "nvidia-dcgm-exporter-rlhcx",
"service": "nvidia-dcgm-exporter"
},
"value": [
1719909159.405,
"0"
]
},
{
"metric": {
"DCGM_FI_DRIVER_VERSION": "535.171.04",
"Hostname": "qxzg-l4server",
"UUID": "GPU-ec1a0983-4e27-c5c1-16f7-534319ffb62c",
"__name__": "DCGM_FI_DEV_GPU_UTIL",
"container": "triton",
"device": "nvidia0",
"gpu": "0",
"instance": "10.42.0.213:9400",
"job": "gpu-metrics",
"kubernetes_node": "qxzg-l4server",
"modelName": "NVIDIA L4",
"namespace": "llm",
"pod": "qwen2-d63eff62-2f6a-427d-b231-e7693a1c2915-747c599cb6-4xjlb"
},
"value": [
1719909159.405,
"0"
]
}
]
}
}
Starting from v0.11.0, the file corresponding to this link does not exist, which may be the cause