CurrentAverageValue isn't an integer & CurrentValue is 0
The issue we are experiencing is that CW adapter is able to read from Cloudwatch(it appears, no auth errors anywhere) but we are getting currentValue of 0 and a currentAverageValue way too large and alphanumeric like 18856m.
It is using IAM service accounts for EKS.
HPA live annotations:
autoscaling.alpha.kubernetes.io/current-metrics: >-[{"type":"External","external":{"metricName":"REPLACE-queue-length","currentValue":"0","currentAverageValue":"18556m"}}]
autoscaling.alpha.kubernetes.io/metrics: >-[{"type":"External","external":{"metricName":"REPLACE-queue-length","targetAverageValue":"40"}}]
Here are my yaml definitions:
---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: REPLACE
labels:
version: REPLACE
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: REPLACE
minReplicas: 2
maxReplicas: 1024
metrics:
- type: External
external:
metricName: REPLACE-queue-length
targetAverageValue: 40
---
apiVersion: metrics.aws/v1alpha1
kind: ExternalMetric
metadata:
name: REPLACE-queue-length
spec:
name: REPLACE-queue-length
resource:
resource: "deployment"
queries:
- id: sqs_REPLACE_files
metricStat:
metric:
namespace: "AWS/SQS"
metricName: "ApproximateNumberOfMessagesVisible"
dimensions:
- name: QueueName
value: REPLACE
period: 60
stat: Average
unit: Count
returnData: true
Here is cloudwatch adapter manifest:
---
apiVersion: v1
kind: Namespace
metadata:
name: custom-metrics
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: k8s-cloudwatch-adapter:system:auth-delegator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:auth-delegator
subjects:
- kind: ServiceAccount
name: k8s-cloudwatch-adapter
namespace: custom-metrics
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: k8s-cloudwatch-adapter-auth-reader
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: extension-apiserver-authentication-reader
subjects:
- kind: ServiceAccount
name: k8s-cloudwatch-adapter
namespace: custom-metrics
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: k8s-cloudwatch-adapter
name: k8s-cloudwatch-adapter
namespace: custom-metrics
spec:
replicas: 1
selector:
matchLabels:
app: k8s-cloudwatch-adapter
template:
metadata:
labels:
app: k8s-cloudwatch-adapter
name: k8s-cloudwatch-adapter
spec:
securityContext:
fsGroup: 65534
serviceAccountName: k8s-cloudwatch-adapter
containers:
- name: k8s-cloudwatch-adapter
env:
- name: AWS_DEFAULT_REGION
value: REPLACE
image: chankh/k8s-cloudwatch-adapter:v0.8.0
imagePullPolicy: "Always"
args:
- /adapter
- --cert-dir=/tmp
- --secure-port=6443
- --logtostderr=true
- --v=10
ports:
- containerPort: 6443
name: https
- containerPort: 8080
name: http
volumeMounts:
- mountPath: /tmp
name: temp-vol
volumes:
- name: temp-vol
emptyDir: {}
- name: token-vol
projected:
sources:
- serviceAccountToken:
path: token
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: k8s-cloudwatch-adapter-resource-reader
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: k8s-cloudwatch-adapter-resource-reader
subjects:
- kind: ServiceAccount
name: k8s-cloudwatch-adapter
namespace: custom-metrics
---
kind: ServiceAccount
apiVersion: v1
metadata:
name: k8s-cloudwatch-adapter
namespace: custom-metrics
---
apiVersion: v1
kind: Service
metadata:
name: k8s-cloudwatch-adapter
namespace: custom-metrics
spec:
ports:
- name: https
port: 443
targetPort: 6443
- name: http
port: 80
targetPort: 8080
selector:
app: k8s-cloudwatch-adapter
---
apiVersion: apiregistration.k8s.io/v1beta1
kind: APIService
metadata:
name: v1beta1.external.metrics.k8s.io
spec:
service:
name: k8s-cloudwatch-adapter
namespace: custom-metrics
group: external.metrics.k8s.io
version: v1beta1
insecureSkipTLSVerify: true
groupPriorityMinimum: 100
versionPriority: 100
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: k8s-cloudwatch-adapter:external-metrics-reader
rules:
- apiGroups:
- external.metrics.k8s.io
resources: ["*"]
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: k8s-cloudwatch-adapter-resource-reader
rules:
- apiGroups:
- ""
resources:
- namespaces
- pods
- services
verbs:
- get
- list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: k8s-cloudwatch-adapter:external-metrics-reader
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: k8s-cloudwatch-adapter:external-metrics-reader
subjects:
- kind: ServiceAccount
name: horizontal-pod-autoscaler
namespace: kube-system
---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: externalmetrics.metrics.aws
spec:
group: metrics.aws
version: v1alpha1
names:
kind: ExternalMetric
plural: externalmetrics
singular: externalmetric
scope: Namespaced
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: k8s-cloudwatch-adapter:crd-metrics-reader
labels:
app: k8s-cloudwatch-adapter
rules:
- apiGroups:
- metrics.aws
resources:
- "externalmetrics"
verbs:
- list
- get
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: k8s-cloudwatch-adapter:crd-metrics-reader
labels:
app: k8s-cloudwatch-adapter
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: k8s-cloudwatch-adapter:crd-metrics-reader
subjects:
- name: k8s-cloudwatch-adapter
namespace: "custom-metrics"
kind: ServiceAccount
Service account definition with EKSCTL:
- metadata:
name: k8s-cloudwatch-adapter
namespace: custom-metrics
labels: {aws-usage: "cluster-ops"}
attachPolicy:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- "cloudwatch:GetMetricData"
- "cloudwatch:GetMetricStatistics"
- "cloudwatch:ListMetrics"
Resource: '*'
I would like to add that the HPA is scheduling more replicas and is scaling up, but it stops scaling and actually scales down - even when pods are processing queue items and those items are in flight. This ends up causing cluster auto scaler to scale-in, ungracefully terminating the pods - leaving items in flight.
You can refer to the HPA docs for details about how scaling work. It looks like you have a lot of messages in your SQS queue and that's why HPA is scheduling more replicas.
Your application need to be able to handle SIGTERM signal so you can use PreStop hook to perform action before your application pod being terminated, i.e. stop consuming messages, handle in-flight messages, etc. For more information, please check out container lifecycle