autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Recommender not using Prometheus history

Open spewu opened this issue 4 years ago • 19 comments

Which component are you using?: recommender - installed via https://charts.fairwinds.com/stable

What version of the component are you using?: k8s.gcr.io/autoscaling/vpa-recommender:0.9.2 fairwinds-stable/vpa helm chart v. 0.4.4

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-14T05:15:04Z", GoVersion:"go1.15.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"0b17c6315e806a66d507e77760a5d60ab5cccfd8", GitTreeState:"clean", BuildDate:"2021-08-30T01:42:22Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

Azure Kubernetes Service (AKS)

What did you expect to happen?:

I expected to see recommendations inside Goldilocks that would make more sense.

What happened instead?:

The current recommendations are very low for CPU, which would break the API if I followed them. So my current theory is that the VPA recommender is not reading the historic data properly.

How to reproduce it (as minimally and precisely as possible):

  1. Create new AKS cluster
  2. Install Prometheus via https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack with this command:
helm upgrade --install --version 19.0.1 --namespace prometheus --create-namespace prometheus prometheus-community/kube-prometheus-stack --values values.yaml

And this values file:

# Disabling scraping of Master Nodes Components
kubeControllerManager:
  enabled: false
kubeScheduler:
  enabled: false
kubeEtcd:
  enabled: false
kubeProxy:
  enabled: false
kubelet:
  serviceMonitor:
    # Disbles the normal cAdvisor scraping, as we add it with the job name "kubernetes-cadvisor" under additionalScrapeConfigs
    # The reason for doing this is to enable the VPA to use the metrics for the recommender
    # https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/FAQ.md#how-can-i-use-prometheus-as-a-history-provider-for-the-vpa-recommender
    cAdvisor: false
prometheus:
  prometheusSpec:
    retention: 30d
    storageSpec:
      volumeClaimTemplate:
        spec:
          # the azurefile storage class is created automatically on AKS
          storageClassName: azurefile
          accessModes: ["ReadWriteMany"]
          resources:
            requests:
              storage: 100Gi
    additionalScrapeConfigs:
      - job_name: 'kubernetes-cadvisor'
        scheme: https
        metrics_path: /metrics/cadvisor
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        kubernetes_sd_configs:
        - role: node
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
  1. Install VPA recommender via: https://github.com/FairwindsOps/charts/tree/master/stable/vpa with this command:
helm upgrade --install --version 0.4.4 --namespace vpa --create-namespace vpa fairwinds-stable/vpa --values values.yaml

With this values file:

updater:
  enabled: false
recommender:
  extraArgs:
    storage: prometheus
    prometheus-address: http://prometheus-kube-prometheus-prometheus.prometheus.svc.cluster.local:9090
  1. Install Goldilocks, to get an easy web UI for seeing the recommendations:
helm upgrade --install --version 3.2.8 --namespace goldilocks goldilocks fairwinds-stable/goldilocks --create-namespace
  1. Deploy something to monitor recommendations for, and let it run for a bit (best with some varying load, so we can get a good history for the recommender to base its recommendations on)

Anything else we need to know?:

My original kube-prometheus-stack values file looked like this:

# Disabling scraping of Master Nodes Components
kubeControllerManager:
  enabled: false
kubeScheduler:
  enabled: false
kubeEtcd:
  enabled: false
kubeProxy:
  enabled: false
prometheus:
  prometheusSpec:
    retention: 30d
    storageSpec:
      volumeClaimTemplate:
        spec:
          # the azurefile storage class is created automatically on AKS
          storageClassName: azurefile
          accessModes: ["ReadWriteMany"]
          resources:
            requests:
              storage: 100Gi

But I followed the advice on this answer on Stack Overflow to get the job label to say kubernetes-cadvisor, and ended up with the values file listed higher up .... but this broke the Grafana dashboards, and it does not seem to have made any impact on the recommender either 🙁

spewu avatar Oct 05 '21 14:10 spewu

Since you're using Goldilocks I'd ask them about this. I think they're not using plan VPA

jbartosik avatar Nov 26 '21 14:11 jbartosik

@jbartosik Goldilocks does use VPA, but can you clarify what "plan VPA" means in this context?

sudermanjr avatar Nov 30 '21 14:11 sudermanjr

I made a typo. I meant "plain VPA". I think Goldilocks makes some changes to VPA. So it will be hard for me to reproduce the problem.

jbartosik avatar Dec 02 '21 08:12 jbartosik

Goldilocks uses upstream VPA with zero modification.

sudermanjr avatar Dec 02 '21 18:12 sudermanjr

Wel are also running into this problem.

jonaz avatar Jan 14 '22 13:01 jonaz

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 14 '22 14:04 k8s-triage-robot

/remove-lifecycle stale

jonaz avatar Apr 15 '22 04:04 jonaz

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 14 '22 04:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Aug 13 '22 05:08 k8s-triage-robot

Can you try setting metric-for-pod-labels configuration value to kube_pod_labels{job="kube-state-metrics"}[8d]?

aslafy-z avatar Aug 26 '22 09:08 aslafy-z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Sep 25 '22 09:09 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 25 '22 09:09 k8s-ci-robot

Could someone reopen this issue please ?

aslafy-z avatar Sep 25 '22 15:09 aslafy-z

/reopen

jonaz avatar Sep 25 '22 21:09 jonaz

@jonaz: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 25 '22 21:09 k8s-ci-robot

😕

jonaz avatar Sep 25 '22 21:09 jonaz

@spewu can reopen it

jonaz avatar Sep 25 '22 21:09 jonaz

/reopen

spewu avatar Sep 26 '22 06:09 spewu

@spewu: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 26 '22 06:09 k8s-ci-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Oct 26 '22 06:10 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Oct 26 '22 06:10 k8s-ci-robot

@spewu can you reopen the issue please ?

aslafy-z avatar Oct 26 '22 08:10 aslafy-z

/remove-lifecycle rotten

aslafy-z avatar Oct 26 '22 08:10 aslafy-z