autoscaler Recommender not using Prometheus history

Which component are you using?: recommender - installed via https://charts.fairwinds.com/stable

What version of the component are you using?: k8s.gcr.io/autoscaling/vpa-recommender:0.9.2 fairwinds-stable/vpa helm chart v. 0.4.4

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-14T05:15:04Z", GoVersion:"go1.15.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"0b17c6315e806a66d507e77760a5d60ab5cccfd8", GitTreeState:"clean", BuildDate:"2021-08-30T01:42:22Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

Azure Kubernetes Service (AKS)

What did you expect to happen?:

I expected to see recommendations inside Goldilocks that would make more sense.

What happened instead?:

The current recommendations are very low for CPU, which would break the API if I followed them. So my current theory is that the VPA recommender is not reading the historic data properly.

How to reproduce it (as minimally and precisely as possible):

Create new AKS cluster
Install Prometheus via https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack with this command:

helm upgrade --install --version 19.0.1 --namespace prometheus --create-namespace prometheus prometheus-community/kube-prometheus-stack --values values.yaml

And this values file:

# Disabling scraping of Master Nodes Components
kubeControllerManager:
  enabled: false
kubeScheduler:
  enabled: false
kubeEtcd:
  enabled: false
kubeProxy:
  enabled: false
kubelet:
  serviceMonitor:
    # Disbles the normal cAdvisor scraping, as we add it with the job name "kubernetes-cadvisor" under additionalScrapeConfigs
    # The reason for doing this is to enable the VPA to use the metrics for the recommender
    # https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/FAQ.md#how-can-i-use-prometheus-as-a-history-provider-for-the-vpa-recommender
    cAdvisor: false
prometheus:
  prometheusSpec:
    retention: 30d
    storageSpec:
      volumeClaimTemplate:
        spec:
          # the azurefile storage class is created automatically on AKS
          storageClassName: azurefile
          accessModes: ["ReadWriteMany"]
          resources:
            requests:
              storage: 100Gi
    additionalScrapeConfigs:
      - job_name: 'kubernetes-cadvisor'
        scheme: https
        metrics_path: /metrics/cadvisor
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        kubernetes_sd_configs:
        - role: node
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)

Install VPA recommender via: https://github.com/FairwindsOps/charts/tree/master/stable/vpa with this command:

helm upgrade --install --version 0.4.4 --namespace vpa --create-namespace vpa fairwinds-stable/vpa --values values.yaml

With this values file:

updater:
  enabled: false
recommender:
  extraArgs:
    storage: prometheus
    prometheus-address: http://prometheus-kube-prometheus-prometheus.prometheus.svc.cluster.local:9090

Install Goldilocks, to get an easy web UI for seeing the recommendations:

helm upgrade --install --version 3.2.8 --namespace goldilocks goldilocks fairwinds-stable/goldilocks --create-namespace

Deploy something to monitor recommendations for, and let it run for a bit (best with some varying load, so we can get a good history for the recommender to base its recommendations on)

Anything else we need to know?:

My original kube-prometheus-stack values file looked like this:

# Disabling scraping of Master Nodes Components
kubeControllerManager:
  enabled: false
kubeScheduler:
  enabled: false
kubeEtcd:
  enabled: false
kubeProxy:
  enabled: false
prometheus:
  prometheusSpec:
    retention: 30d
    storageSpec:
      volumeClaimTemplate:
        spec:
          # the azurefile storage class is created automatically on AKS
          storageClassName: azurefile
          accessModes: ["ReadWriteMany"]
          resources:
            requests:
              storage: 100Gi

But I followed the advice on this answer on Stack Overflow to get the job label to say kubernetes-cadvisor, and ended up with the values file listed higher up .... but this broke the Grafana dashboards, and it does not seem to have made any impact on the recommender either 🙁

Oct 05 '21 14:10 spewu

Since you're using Goldilocks I'd ask them about this. I think they're not using plan VPA

Nov 26 '21 14:11 jbartosik

@jbartosik Goldilocks does use VPA, but can you clarify what "plan VPA" means in this context?

Nov 30 '21 14:11 sudermanjr

I made a typo. I meant "plain VPA". I think Goldilocks makes some changes to VPA. So it will be hard for me to reproduce the problem.

Dec 02 '21 08:12 jbartosik

Goldilocks uses upstream VPA with zero modification.

Dec 02 '21 18:12 sudermanjr

Wel are also running into this problem.

Jan 14 '22 13:01 jonaz

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 14 '22 14:04 k8s-triage-robot

/remove-lifecycle stale

Apr 15 '22 04:04 jonaz

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 14 '22 04:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Aug 13 '22 05:08 k8s-triage-robot

Can you try setting metric-for-pod-labels configuration value to kube_pod_labels{job="kube-state-metrics"}[8d]?

Aug 26 '22 09:08 aslafy-z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Sep 25 '22 09:09 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 25 '22 09:09 k8s-ci-robot

Could someone reopen this issue please ?

Sep 25 '22 15:09 aslafy-z

/reopen

Sep 25 '22 21:09 jonaz

@jonaz: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 25 '22 21:09 k8s-ci-robot

😕

Sep 25 '22 21:09 jonaz

@spewu can reopen it

Sep 25 '22 21:09 jonaz

/reopen

Sep 26 '22 06:09 spewu

@spewu: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 26 '22 06:09 k8s-ci-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Oct 26 '22 06:10 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Oct 26 '22 06:10 k8s-ci-robot

@spewu can you reopen the issue please ?

Oct 26 '22 08:10 aslafy-z

/remove-lifecycle rotten

Oct 26 '22 08:10 aslafy-z

autoscaler autoscaler copied to clipboard

Recommender not using Prometheus history

autoscaler
autoscaler copied to clipboard