autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

With humanize-memory some MEM recommendation use "millibyte" unit

Open kyleli666 opened this issue 10 months ago • 8 comments

Which component are you using?: /area vertical-pod-autoscaler

What version of the component are you using?:

Component version: Image tag 1.3.0

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: v1.29.13
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.8-eks-2d5f260

What environment is this in?: EKS

What did you expect to happen?: Memory recommendation in unit "Mi"

What happened instead?: Memory recommendation in unit "m"

How to reproduce it (as minimally and precisely as possible): my vpa is installed with

tag=1.3.0; helm upgrade -i -n kube-system vpa fairwinds-stable/vpa -f vpa-values.yaml --set recommender.image.tag=$tag,updater.image.tag=$tag,admissionController.image.tag=$tag
# vpa-values.yaml
recommender:
  # recommender.enabled -- If true, the vpa recommender component will be installed.
  enabled: true
  # recommender.extraArgs -- A set of key-value flags to be passed to the recommender
  extraArgs:
    v: "4"
    humanize-memory: true # starting from 1.3.0
    pod-recommendation-min-cpu-millicores: 10
    pod-recommendation-min-memory-mb: 50
    target-cpu-percentile: 0.50
    target-memory-percentile: 0.50
  replicaCount: 1

Without "humanize-memory: true", I get this result

$ k get vpa -A -w
NAMESPACE     NAME                       MODE      CPU    MEM         PROVIDED   AGE
kube-system   vpa-admission-controller   Auto      11m    52428800    True       158d
kube-system   vpa-recommender            Auto      11m    63544758    True       158d
kube-system   vpa-updater                Auto      11m    78221997    True       158d
logging       fluent-bit-service         Initial   126m   225384266   True       3d2h
monitoring    grafana                    Auto      11m    93633096    True       146d

With "humanize-memory: true", I get a result where 50Mi looks good but the others not.

$ k get vpa -A -w
NAMESPACE     NAME                       MODE      CPU    MEM             PROVIDED   AGE
kube-system   vpa-admission-controller   Auto      11m    50Mi            True       158d
kube-system   vpa-recommender            Auto      11m    63543705600m    True       158d
kube-system   vpa-updater                Auto      11m    93637836800m    True       158d
logging       fluent-bit-service         Initial   126m   272063528960m   True       3d2h
monitoring    grafana                    Auto      11m    93637836800m    True       146d

My VPA Objects for the VPA components.

---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-admission-controller
  namespace: kube-system
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vpa-admission-controller
  updatePolicy:
    updateMode: Auto
---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-recommender
  namespace: kube-system
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vpa-recommender
  updatePolicy:
    updateMode: Auto
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        minAllowed:
          memory: 50Mi
---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-updater
  namespace: kube-system
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vpa-updater
  updatePolicy:
    updateMode: Auto

Anything else we need to know?: my logs recommender.log

kyleli666 avatar Jan 26 '25 08:01 kyleli666

cc @omerap12

adrianmoisey avatar Jan 26 '25 08:01 adrianmoisey

I was debugging this. I've made a failing test to assist with fixing it: https://github.com/kubernetes/autoscaler/pull/7771 At the moment my best guess is that this line is returning it to a non-humanised version: https://github.com/kubernetes/autoscaler/blob/3291baee042126cbf64334649d7b3e3c7efbe478/vertical-pod-autoscaler/pkg/recommender/model/types.go#L100

adrianmoisey avatar Jan 26 '25 09:01 adrianmoisey

Thanks for sharing this. Ill take a look. /assign

omerap12 avatar Jan 26 '25 09:01 omerap12

I was using prometheus-adapter to get CPU metrics and found that the adapter was reporting Memory in bytes with a large number of decimals after, that resulted in:

  • https://github.com/kubernetes-sigs/headlamp/issues/3067

Surprisingly has apparently nothing to do with this issue, as I've fixed prometheus-adapter metrics to report whole number bytes only (rounded to a million bytes) but for a while I was suspicious that VPA sometimes reporting millibytes had something to do with that issue, I no longer think it does.

Maybe that Karpenter is interpreting a request in millibytes as a request in bytes too, IDK if that's really what's happening, can't see inside the box, but I just nailed down all these issues in my cluster and suddenly we're requesting 10's of GB's less memory according to EKS Auto Mode.

I'm just gonna disable humanize-memory for a little while and hope this gets worked out, I'm following #7855 now and it looks like it's a bit more complicated than I assumed.

kingdonb avatar Apr 02 '25 21:04 kingdonb

@kingdonb, I must admit I don't fully understand your issue, but yeah, the resource.Quantity is behaving differently than I expected. I'm still waiting for a response from sig-api-machinery, and I agree that this feature isn't very useful at the moment.

omerap12 avatar Apr 03 '25 06:04 omerap12

@omerap12 did you get any response from sig-api-machinery?

marcofranssen avatar May 20 '25 11:05 marcofranssen

@omerap12 did you get any response from sig-api-machinery?

Unfortunately, no.

omerap12 avatar May 20 '25 11:05 omerap12

The issue is worse than I thought, it's not only a problem in VPA... why does Kubernetes have "millibytes" at all? That seems like a nonsense unit, but it's easy to get it into your API machinery, all you need to do is set a LimitRange with a default limit of 1.1Gi and your nodes aggregate limit will be calculated in millibytes forever henceforth.

I spent some time "hunting down" server-side values because I thought the old VPA experiments with humanize-memory were responsible for them, but in the end I tore down all the resources, had this one "error" in my LimitRanges, and the bug was still triggered. Headlamp reacts very poorly to seeing a metric value of Millibytes.

ref:

  • https://github.com/kubernetes-sigs/headlamp/issues/3067

...and judging by the fact that I was getting always memory-focused instances from AWS Karpenter (EKS Auto Mode) until I corrected this issue, I think so does Karpenter. In Headlamp, at least, I dug into the unit parsing library in Typescript, found that they assumed memory can not be measured in millibytes, and that they were falling through to the default case - bytes - which means you're getting a 1000x multiplier on those values.

kingdonb avatar Jun 19 '25 02:06 kingdonb

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 17 '25 02:09 k8s-triage-robot