服务的 pod 数较多时,ehpa 报错
当服务的 pod 较多时(我的服务时 150个 pod),get ehpa 命令看到如下错误,pod 数少时没问题:
the HPA was unable to compute the replica count: unable to get metric tensorflow_serving_latency_999: unable to fetch metrics from custom metrics API: Internal error occurred: unable to fetch metrics
prometheus-adapter 里有如下日志: E0717 06:07:04.633224 1 provider.go:150] unable to fetch metrics from prometheus: bad_response: unknown response code 414 I0717 06:07:04.633771 1 httplog.go:132] "HTTP" verb="GET" URI="/apis/custom.metrics.k8s.io/v1beta1/namespaces/qke-generic-jarvis-cupid-algo/pods/%2A/tensorflow_serving_latency_999?labelSelector=name%3Djarvis-ads-algo-cpx-e2-episode-pcvr-26035-qpaas-hslf" latency="339.610784ms" userAgent="kube-controller-manager/v1.24.15 (linux/amd64) kubernetes/887f5c3/system:serviceaccount:kube-system:horizontal-pod-autoscaler" audit-ID="08f823f8-afd8-43e0-b44c-fdc09a64b612" srcIP="10.188.121.103:58614" resp=500 statusStack=<
goroutine 1930959 [running]:
k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc001253a20, 0xc06407afc0?)
/go/pkg/mod/k8s.io/[email protected]/pkg/server/httplog/httplog.go:320 +0x105
k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc001253a20, 0xc0885aedc0?)
/go/pkg/mod/k8s.io/[email protected]/pkg/server/httplog/httplog.go:300 +0x25
k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).WriteHeader(0xc06407aff0, 0x9100000000000010?)
/go/pkg/mod/k8s.io/[email protected]/pkg/server/filters/timeout.go:239 +0x1c8
k8s.io/apiserver/pkg/endpoints/metrics.(*ResponseWriterDelegator).WriteHeader(0x1f559e0?, 0xc0885aedc0?)
/go/pkg/mod/k8s.io/[email protected]/pkg/endpoints/metrics/metrics.go:737 +0x29
k8s.io/apiserver/pkg/endpoints/handlers/responsewriters.(*deferredResponseWriter).Write(0xc0b112a120, {0xc00f028000, 0x99, 0x9f})
/go/pkg/mod/k8s.io/[email protected]/pkg/endpoints/handlers/responsewriters/writers.go:243 +0x642
k8s.io/apimachinery/pkg/runtime/serializer/protobuf.(*Serializer).doEncode(0xc000a19a00, {0x27335e8?, 0xc0015ac320?}, {0x272b1e0, 0xc0b112a120}, {0x272adc0?, 0x392f298?})
/go/pkg/mod/k8s.io/[email protected]/pkg/runtime/serializer/protobuf/protobuf.go:228 +0x5b9
k8s.io/apimachinery/pkg/runtime/serializer/protobuf.(*Serializer).encode(0xc000a19a00, {0x27335e8, 0xc0015ac320}, {0x272b1e0, 0xc0b112a120}, {0x272adc0?, 0x392f298?})
/go/pkg/mod/k8s.io/[email protected]/pkg/runtime/serializer/protobuf/protobuf.go:181 +0x13d
k8s.io/apimachinery/pkg/runtime/serializer/protobuf.(*Serializer).Encode(0x0?, {0x27335e8?, 0xc0015ac320?}, {0x272b1e0?, 0xc0b112a120?})
/go/pkg/mod/k8s.io/[email protected]/pkg/runtime/serializer/protobuf/protobuf.go:174 +0x3b
k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).doEncode(0xc0015ac3c0, {0x27335e8, 0xc0015ac320}, {0x272b1e0, 0xc0b112a120}, {0x0?, 0x0?})
/go/pkg/mod/k8s.io/[email protected]/pkg/runtime/serializer/versioning/versioning.go:268 +0xc05
k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).encode(0xc0015ac3c0, {0x27335e8, 0xc0015ac320}, {0x272b1e0, 0xc0b112a120}, {0x0?, 0x0?})
/go/pkg/mod/k8s.io/[email protected]/pkg/runtime/serializer/versioning/versioning.go:214 +0x167
k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).Encode(0x274ae08?, {0x27335e8?, 0xc0015ac320?}, {0x272b1e0?, 0xc0b112a120?})
ehpa 中配置的 prometheus query 如下: annotations: metric-query.autoscaling.crane.io/services.tensorflow_serving_latency_999: avg(tensorflow_serving_latency_999{namespace="namespace",pod~="abcd."}) 但是 prometheus-adapter 中有如下的 api 报错 uri 太长: GET http://..../api/v1/query?query=sum%28tensorflow_serving_latency_999%7Bnamespace%3D%22qke-generic-jarvis-cupid-algo%22%2Cpod%3D~%22jarvis-ads-algo-cpx-e2-episode-pcvr-26035-qpaas-hslf-6d56d225cd....... 该 uri 是把负载的所有pod都列入了,导致 URI 过长。但 ehpa 中的 query 是 avg(tensorflow_serving_latency_999{namespace="namespace",pod~="abcd."}),怎么会有报错的 uri 请求呢?
如果你这个 metrics 类型是 Pod 的话,prometheus-adapter 那边的查询会自动带上 pod 标签的。你可以把 prometheus-adapter 的查询方式改成 POST,这个在启动参数里可以更改