autoscaler
autoscaler copied to clipboard
With humanize-memory some MEM recommendation use "millibyte" unit
Which component are you using?: /area vertical-pod-autoscaler
What version of the component are you using?:
Component version: Image tag 1.3.0
What k8s version are you using (kubectl version)?:
kubectl version Output
$ kubectl version Client Version: v1.29.13 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.30.8-eks-2d5f260
What environment is this in?: EKS
What did you expect to happen?: Memory recommendation in unit "Mi"
What happened instead?: Memory recommendation in unit "m"
How to reproduce it (as minimally and precisely as possible): my vpa is installed with
tag=1.3.0; helm upgrade -i -n kube-system vpa fairwinds-stable/vpa -f vpa-values.yaml --set recommender.image.tag=$tag,updater.image.tag=$tag,admissionController.image.tag=$tag
# vpa-values.yaml
recommender:
# recommender.enabled -- If true, the vpa recommender component will be installed.
enabled: true
# recommender.extraArgs -- A set of key-value flags to be passed to the recommender
extraArgs:
v: "4"
humanize-memory: true # starting from 1.3.0
pod-recommendation-min-cpu-millicores: 10
pod-recommendation-min-memory-mb: 50
target-cpu-percentile: 0.50
target-memory-percentile: 0.50
replicaCount: 1
Without "humanize-memory: true", I get this result
$ k get vpa -A -w
NAMESPACE NAME MODE CPU MEM PROVIDED AGE
kube-system vpa-admission-controller Auto 11m 52428800 True 158d
kube-system vpa-recommender Auto 11m 63544758 True 158d
kube-system vpa-updater Auto 11m 78221997 True 158d
logging fluent-bit-service Initial 126m 225384266 True 3d2h
monitoring grafana Auto 11m 93633096 True 146d
With "humanize-memory: true", I get a result where 50Mi looks good but the others not.
$ k get vpa -A -w
NAMESPACE NAME MODE CPU MEM PROVIDED AGE
kube-system vpa-admission-controller Auto 11m 50Mi True 158d
kube-system vpa-recommender Auto 11m 63543705600m True 158d
kube-system vpa-updater Auto 11m 93637836800m True 158d
logging fluent-bit-service Initial 126m 272063528960m True 3d2h
monitoring grafana Auto 11m 93637836800m True 146d
My VPA Objects for the VPA components.
---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: vpa-admission-controller
namespace: kube-system
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: vpa-admission-controller
updatePolicy:
updateMode: Auto
---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: vpa-recommender
namespace: kube-system
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: vpa-recommender
updatePolicy:
updateMode: Auto
resourcePolicy:
containerPolicies:
- containerName: '*'
minAllowed:
memory: 50Mi
---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: vpa-updater
namespace: kube-system
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: vpa-updater
updatePolicy:
updateMode: Auto
Anything else we need to know?: my logs recommender.log
cc @omerap12
I was debugging this. I've made a failing test to assist with fixing it: https://github.com/kubernetes/autoscaler/pull/7771 At the moment my best guess is that this line is returning it to a non-humanised version: https://github.com/kubernetes/autoscaler/blob/3291baee042126cbf64334649d7b3e3c7efbe478/vertical-pod-autoscaler/pkg/recommender/model/types.go#L100
Thanks for sharing this. Ill take a look. /assign
I was using prometheus-adapter to get CPU metrics and found that the adapter was reporting Memory in bytes with a large number of decimals after, that resulted in:
- https://github.com/kubernetes-sigs/headlamp/issues/3067
Surprisingly has apparently nothing to do with this issue, as I've fixed prometheus-adapter metrics to report whole number bytes only (rounded to a million bytes) but for a while I was suspicious that VPA sometimes reporting millibytes had something to do with that issue, I no longer think it does.
Maybe that Karpenter is interpreting a request in millibytes as a request in bytes too, IDK if that's really what's happening, can't see inside the box, but I just nailed down all these issues in my cluster and suddenly we're requesting 10's of GB's less memory according to EKS Auto Mode.
I'm just gonna disable humanize-memory for a little while and hope this gets worked out, I'm following #7855 now and it looks like it's a bit more complicated than I assumed.
@kingdonb, I must admit I don't fully understand your issue, but yeah, the resource.Quantity is behaving differently than I expected. I'm still waiting for a response from sig-api-machinery, and I agree that this feature isn't very useful at the moment.
@omerap12 did you get any response from sig-api-machinery?
The issue is worse than I thought, it's not only a problem in VPA... why does Kubernetes have "millibytes" at all? That seems like a nonsense unit, but it's easy to get it into your API machinery, all you need to do is set a LimitRange with a default limit of 1.1Gi and your nodes aggregate limit will be calculated in millibytes forever henceforth.
I spent some time "hunting down" server-side values because I thought the old VPA experiments with humanize-memory were responsible for them, but in the end I tore down all the resources, had this one "error" in my LimitRanges, and the bug was still triggered. Headlamp reacts very poorly to seeing a metric value of Millibytes.
ref:
- https://github.com/kubernetes-sigs/headlamp/issues/3067
...and judging by the fact that I was getting always memory-focused instances from AWS Karpenter (EKS Auto Mode) until I corrected this issue, I think so does Karpenter. In Headlamp, at least, I dug into the unit parsing library in Typescript, found that they assumed memory can not be measured in millibytes, and that they were falling through to the default case - bytes - which means you're getting a 1000x multiplier on those values.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale