krr icon indicating copy to clipboard operation
krr copied to clipboard

[Feature] Add support Horizontal Pod Autoscaling

Open patsevanton opened this issue 2 years ago • 21 comments

Hello! Please add support Horizontal Pod Autoscaling I view current output:

│ 1.0 -> ? (No data)     │ 1.0 -> ? (No data)      │             │ 100Mi -> ? (No data)   │ 2000Mi -> ? (No 
┼────────────────────────┼─────────────────────────┼─────────────┼────────────────────────┼──────────────────
│ 100m -> ? (No data)    │ 100m -> ? (No data)     │             │ 100Mi -> ? (No data)   │ 100Mi -> ? (No 
┼────────────────────────┼─────────────────────────┼─────────────┼────────────────────────┼──────────────────
│ 100m -> ? (No data)    │ 100m -> ? (No data)     │             │ 256Mi -> ? (No data)   │ 256Mi -> ? (No 
┼────────────────────────┼─────────────────────────┼─────────────┼────────────────────────┼──────────────────
│ 1.0 -> ? (No data)     │ 1.0 -> ? (No data)      │             │ 256Mi -> ? (No data)   │ 256Mi -> ? (No 
┼────────────────────────┼─────────────────────────┼─────────────┼────────────────────────┼──────────────────
│ 1.0 -> ? (No data)     │ 1.0 -> ? (No data)      │             │ 100Mi -> ? (No data)   │ 2000Mi -> ? (No 
┼────────────────────────┼─────────────────────────┼─────────────┼────────────────────────┼──────────────────
│ 1.0 -> ? (No data)     │ 1.0 -> ? (No data)      │             │ 256Mi -> ? (No data)   │ 256Mi -> ? (No 
┼────────────────────────┼─────────────────────────┼─────────────┼────────────────────────┼──────────────────
│ 100m -> ? (No data)    │ 100m -> ? (No data)     │             │ 100Mi -> ? (No data)   │ 100Mi -> ? (No 
┼────────────────────────┼─────────────────────────┼─────────────┼────────────────────────┼──────────────────
│ 100m -> ? (No data)    │ 500m -> ? (No data)     │             │ 100Mi -> ? (No data)   │ 500Mi -> ? (No 
┼────────────────────────┼─────────────────────────┼─────────────┼────────────────────────┼──────────────────
│ 100m -> ? (No data)    │ 100m -> ? (No data)     │             │ 100Mi -> ? (No data)   │ 100Mi -> ? (No 
┼────────────────────────┼─────────────────────────┼─────────────┼────────────────────────┼──────────────────
│ 100m -> ? (No data)    │ 100m -> ? (No data)     │             │ 100Mi -> ? (No data)   │ 100Mi -> ? (No 
┼────────────────────────┼─────────────────────────┼─────────────┼────────────────────────┼──────────────────
│ 100m -> ? (No data)    │ 500m -> ? (No data)     │             │ 100Mi -> ? (No data)   │ 500Mi -> ? (No 
┼────────────────────────┼─────────────────────────┼─────────────┼────────────────────────┼──────────────────
│ 500m -> ? (No data)    │ 1.0 -> ? (No data)      │             │ 500Mi -> ? (No data)   │ 1500Mi -> ? (No 
┼────────────────────────┼─────────────────────────┼─────────────┼────────────────────────┼──────────────────
│ 500m -> ? (No data)    │ 1.0 -> ? (No data)      │             │ 1000Mi -> ? (No data)  │ 1500Mi -> ? (No 
┼────────────────────────┼─────────────────────────┼─────────────┼────────────────────────┼──────────────────
│ 100m -> ? (No data)    │ 500m -> ? (No data)     │             │ 100Mi -> ? (No data)   │ 500Mi -> ? (No 
┼────────────────────────┼─────────────────────────┼─────────────┼────────────────────────┼──────────────────
│ 100m -> ? (No data)    │ 500m -> ? (No data)     │             │ 100Mi -> ? (Not enough │ 500Mi -> ? (Not enough  

patsevanton avatar Nov 07 '23 14:11 patsevanton

Hey all, we're still figuring out the appropriate algorithm to use in this case. (The usual logic doesn't work if you're scaling according to CPU/memory utilization.)

To help, we'd love to hear from each of you:

  1. How is your HPA defined? What metric do you use to scale?
  2. What logic would make the most sense for recommendations when using the HPA?

aantn avatar Nov 16 '23 05:11 aantn

I use standart request CPU and request Memory

patsevanton avatar Nov 17 '23 03:11 patsevanton

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    meta.helm.sh/release-name: application
    meta.helm.sh/release-namespace: application
  creationTimestamp: "2023-07-24T19:28:02Z"
  labels:
    app-name: application
    app-version: release-f469738c
    app.kubernetes.io/managed-by: Helm
  name: application
  namespace: application
  resourceVersion: "249236825"
  uid: 15d78a48-a594-4908-8723-82ba7291f6b0
spec:
  behavior:
    scaleDown:
      policies:
      - periodSeconds: 60
        type: Pods
        value: 4
      - periodSeconds: 60
        type: Percent
        value: 25
      selectPolicy: Max
    scaleUp:
      policies:
      - periodSeconds: 60
        type: Pods
        value: 4
      - periodSeconds: 60
        type: Percent
        value: 25
      selectPolicy: Max
      stabilizationWindowSeconds: 0
  maxReplicas: 25
  metrics:
  - resource:
      name: memory
      target:
        averageUtilization: 80
        type: Utilization
    type: Resource
  - resource:
      name: cpu
      target:
        averageUtilization: 80
        type: Utilization
    type: Resource
  minReplicas: 6
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: application
status:
  conditions:
  - lastTransitionTime: "2023-07-24T19:28:17Z"
    message: recommended size matches current size
    reason: ReadyForNewScale
    status: "True"
    type: AbleToScale
  - lastTransitionTime: "2023-11-16T19:17:09Z"
    message: the HPA was able to successfully calculate a replica count from memory
      resource utilization (percentage of request)
    reason: ValidMetricFound
    status: "True"
    type: ScalingActive
  - lastTransitionTime: "2023-11-16T18:59:50Z"
    message: the desired replica count is less than the minimum replica count
    reason: TooFewReplicas
    status: "True"
    type: ScalingLimited
  currentMetrics:
  - resource:
      current:
        averageUtilization: 59
        averageValue: 457606485333m
      name: memory
    type: Resource
  - resource:
      current:
        averageUtilization: 18
        averageValue: 202m
      name: cpu
    type: Resource
  currentReplicas: 6
  desiredReplicas: 6
  lastScaleTime: "2023-11-16T18:04:52Z"

patsevanton avatar Nov 17 '23 03:11 patsevanton

Thanks. In a case like this, we can't just use the standard KRR strategy, because it leads to unstable recommendations. E.g. if the HPA sets target.averageUtilization: 80 for CPU, but KRR is outputing recommendations for CPU at the 99th percentile, then KRR's recommendations can potentially cause HPA to always scale and HPA's behaviour can cause KRR's recommendations to then change.

We likely need a strategy that takes both into account in the first place.

aantn avatar Nov 19 '23 10:11 aantn

Hi, instead of providing recommendations for deployment, where HPA is enabled is it able to provide recommendations based on each pod? Almost every pod in HPA is getting same memory and CPU utilization

sksaranraj avatar Feb 23 '24 05:02 sksaranraj

Hello @aantn, Is there any plan to add support for HPA resources?

dgdevops avatar Mar 05 '24 12:03 dgdevops

Yep, the tricky part is - as always - what the algorithm should be.

We have some ideas for the algorithm, but for starters we're going to add an --allow-hpa flag so that you can always say on the current algorithm "give me a recommendation even though the pod uses the HPA, it's OK with me".

And if you have inputs on what an HPA-optimized algorithm looks like, I want to chat!

aantn avatar Mar 05 '24 12:03 aantn

Thank you for your prompt response, having the flag is a great start in my opinion.

dgdevops avatar Mar 05 '24 12:03 dgdevops

Of course. We'll have a PR fairly soon.

aantn avatar Mar 05 '24 12:03 aantn

Great stuff @aantn, what are the plans & timeline for releasing it?

dgdevops avatar Mar 06 '24 11:03 dgdevops

Can you try --allow_hpa on #226?

aantn avatar Mar 07 '24 17:03 aantn

Hello @aantn, I have tested the --allow_hpa flag and now I can see CPU & Memory recommendations for the workloads that we have HPA configured for.

dgdevops avatar Mar 08 '24 08:03 dgdevops

Excellent. Happy to hear it!

aantn avatar Mar 08 '24 09:03 aantn

Thank you @aantn for the quick implementation

dgdevops avatar Mar 08 '24 09:03 dgdevops

Which version does the allow_hpa function work with?

patsevanton avatar Mar 09 '24 04:03 patsevanton

Only on the branch in #226. Until we merge and do a release, you'll have to check it out locally and run from source, according to the instructions in the README.

aantn avatar Mar 09 '24 04:03 aantn

Only on the branch in #226. Until we merge and do a release, you'll have to check it out locally and run from source, according to the instructions in the README.

how can I download a binary from this branch without downloading the entire project?

patsevanton avatar Mar 16 '24 03:03 patsevanton

It's not possible at the moment. You need to checkout the whole project and follow the from source instructions here: https://github.com/robusta-dev/krr?tab=readme-ov-file#installation-methods

aantn avatar Mar 16 '24 05:03 aantn

It's not possible at the moment. You need to checkout the whole project and follow the from source instructions here: https://github.com/robusta-dev/krr?tab=readme-ov-file#installation-methods

I'll wait for the new release.

patsevanton avatar Mar 16 '24 06:03 patsevanton

May be create alpha release for test hpa?

patsevanton avatar Apr 22 '24 05:04 patsevanton

This is included in the latest release! Let me know if it works for you.

aantn avatar Apr 22 '24 17:04 aantn