keda icon indicating copy to clipboard operation
keda copied to clipboard

Respect to cooldownPeriod for the first deployment and let service is up and and running based on replica number for the first time.

Open nuved opened this issue 1 year ago • 20 comments

Proposal

Hey From my understanding based on the current documentation, the cooldownPeriod in KEDA only takes effect after a scaling trigger has occurred. When initially deploying a Deployment, StatefulSet, KEDA immediately scales it to minReplicaCount, regardless of the cooldownPeriod.

It would be incredibly beneficial if the cooldownPeriod could also apply when scaling resources for the first time. Specifically, this would mean that upon deployment, the resource scales based on the defined replicas in the Deployment or StatefulSet and respects the cooldownPeriod before any subsequent scaling operations.

Use-Case

This enhancement would provide teams with a more predictable deployment behavior, especially during CI/CD processes. Ensuring that a new version of a service is stable upon deployment is critical, and this change would give teams more confidence during releases.

Is this a feature you are interested in implementing yourself?

No

Anything else?

No response

nuved avatar Sep 27 '23 09:09 nuved

Hello, During CD process KEDA doesn't modify the workload. I mean, IIRC you are right about the 1st time deployment and KEDA doesn't take into account it (for scaling to 0, never for scaling to minReplicaCount)

Do you see this behaviour on every CD? I mean, does this happen every time when you deploy your workload? Are your workload scaled 0 or to minReplicaCount? Could you share an example of your ScaledObject and also an example of your workload?

JorTurFer avatar Sep 27 '23 12:09 JorTurFer

Hello, During CD process KEDA doesn't modify the workload. I mean, IIRC you are right about the 1st time deployment and KEDA doesn't take into account it (for scaling to 0, never for scaling to minReplicaCount)

Do you see this behaviour on every CD? I mean, does this happen every time when you deploy your workload? Are your workload scaled 0 or to minReplicaCount? Could you share an example of your ScaledObject and also an example of your workload?

Hey ,

This is my configuration for keda , minimum replica is set on 0.

spec:
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          policies:
          - periodSeconds: 300
            type: Pods
            value: 1
          stabilizationWindowSeconds: 1800
        scaleUp:
          policies:
          - periodSeconds: 300
            type: Percent
            value: 100
          stabilizationWindowSeconds: 0
    restoreToOriginalReplicaCount: true
  cooldownPeriod: 1800
  fallback:
    failureThreshold: 3
    replicas: 1
  maxReplicaCount: 10
  minReplicaCount: 0
  pollingInterval: 20
  scaleTargetRef:
    name: test
  triggers:
  - authenticationRef:
      name: test
    metadata:
      mode: QueueLength
      protocol: amqp
      queueName: test
      value: "150"
    type: rabbitmq

In other side, the replica count of deployment service is set on 1 . the deployment also has liveness and readiness probes and most of times service needs to have 3 minutes to be up and ready .

This is the command that our CD is running each time when deploying the service.

helm upgrade test ./ --install -f value.yaml -n test --set 'image.tag=test_6.0.0' --atomic --timeout 1200s

When using Helm with the --atomic flag, Helm expects the service to be up and their readiness/liveness probes to pass to mark the deployment as successful. However, with KEDA set to a minReplica of 0, our service is immediately scaled down to zero replicas, even before triggers are recognized.

This behavior leads Helm to assume the deployment was successful, while it's not necessery true. actually service was not up and running for even 20 seconds , it was killed by keda because the minimum replica is set on 0 .

I believe, respect to cooldownPeriod and use the replica count of deployment when deploying service can be beneficial in this cases .

For the moment I have to set the minimum replica to 1 to fix this issue.

nuved avatar Sep 28 '23 09:09 nuved

In other side, the replica count of deployment service is set on 1 . the deployment also has liveness and readiness probes and most of times service needs to have 3 minutes to be up and ready .

Do you mean that your helm chart always set replicas: 1? Don't you have any condition to skip this setting? Deployment manifest is idempotent, I mean, whatever you set there will be applied at least during a few seconds, if you set 1, your workload will scale to 1 until the next HPA Controller cycle.

As I said, the first time when you deploy an ScaledObject this could happen, but not in the next times, and the reason behind this behavior on next times can be that you are explicitly setting replicas in deployment manifest.

JorTurFer avatar Sep 28 '23 09:09 JorTurFer

In other side, the replica count of deployment service is set on 1 . the deployment also has liveness and readiness probes and most of times service needs to have 3 minutes to be up and ready .

Do you mean that your helm chart always set replicas: 1? Don't you have any condition to skip this setting? Deployment manifest is idempotent, I mean, whatever you set there will be applied at least during a few seconds, if you set 1, your workload will scale to 1 until the next HPA Controller cycle.

As I said, the first time when you deploy an ScaledObject this could happen, but not in the next times, and the reason behind this behavior on next times can be that you are explicitly setting replicas in deployment manifest.

Yes , replica is set on 1 in the deployment of service.
I even increase initialDelaySeconds to 300 for liveness and readiness, so normally when I set minimum replica of keda to 1 , helm will wait for 300 seconds to get confirmation that service is up and running .

when I set the minimum replica to 0 , after 5 seconds, the service is shutdown by keda and helm said that service is deployed successfully while it's not right!

kubectl describe ScaledObject -n test
Normal  KEDAScalersStarted          5s   keda-operator  Started scalers watch
Normal  ScaledObjectReady           5s   keda-operator  ScaledObject is ready for scaling
Normal  KEDAScaleTargetDeactivated  5s   keda-operator  Deactivated apps/v1.Deployment test/test from 1 to 0

And please consider the ScaledObject is applied by helm alongside other things like deployment ingress and service.

moreover, we do use keda in our stages envs that is not under load most of times . so most of times there is no queue in the message and it's 0 . so replica is set on 0 and it's fine! the issue is raised when we deploy a new versions , how can we make sure the service is working well and is not crashing when it will be shutdown by keda.

As a result , it would be great keda use the replica of deployment as a base each time.
keda needs to use set replica to 1 replicas: 1 maxReplicaCount: 10 minReplicaCount: 0

keda still should set replica 1 when deploy service! replicas: 1 maxReplicaCount: 10 minReplicaCount: 5

in this case, keda can set replica 5 when re-deploy services. replicas: 5 maxReplicaCount: 10 minReplicaCount: 5

I can even set an annotation based on time for ScaledObject ( by help of helm ) so ScaledObject will be updated after each deploy.

nuved avatar Sep 28 '23 10:09 nuved

I guess that we could implement some initialDelay or something so, but I'm still not sure why this happens after the 1st deployment. The 1st time it can happen, but after that I thought that it shouldn't. Am I missing any important point @zroubalik ?

JorTurFer avatar Oct 03 '23 21:10 JorTurFer

Yeah, this is something we can add. Right now KEDA imidiatelly scales to minReplicas, if there's no load.

zroubalik avatar Nov 01 '23 14:11 zroubalik

+1.

We have exactly the same requirement. KEDA should have an initialDelay before starting to make scaling decisions. This is very helpful when you deploy something and need it immediately available. Then KEDA should scale things to idle/minimum if not used.

Imagine a deployment with prometheus as trigger (or any other pull trigger). The deployment is immediately scaled to zero and only after pull interval it will be available again..

pintonunes avatar Nov 22 '23 18:11 pintonunes

Proposal

Hey From my understanding based on the current documentation, the cooldownPeriod in KEDA only takes effect after a scaling trigger has occurred. When initially deploying a Deployment, StatefulSet, KEDA immediately scales it to minReplicaCount, regardless of the cooldownPeriod.

It would be incredibly beneficial if the cooldownPeriod could also apply when scaling resources for the first time. Specifically, this would mean that upon deployment, the resource scales based on the defined replicas in the Deployment or StatefulSet and respects the cooldownPeriod before any subsequent scaling operations.

Use-Case

This enhancement would provide teams with a more predictable deployment behavior, especially during CI/CD processes. Ensuring that a new version of a service is stable upon deployment is critical, and this change would give teams more confidence during releases.

Is this a feature you are interested in implementing yourself?

No

Anything else?

No response

I agree that implementing a cooldown period for initial scaling in KEDA is extremely beneficial, especially when using KEDA for serverless architectures. It's crucial to have a cooldown period after the first deployment, before allowing the system to scale down to zero. This cooldown would provide a stabilization phase for the system, ensuring that the service runs smoothly post-deployment before scaling down. Such a design not only enhances post-deployment stability but also aids in assessing the deployment's effectiveness before the service scales down to zero. This cooldown period is particularly important for ensuring smooth and predictable scaling behavior in serverless environments.

helloxjade avatar Dec 05 '23 16:12 helloxjade

Maybe we can easily fix this just honoring cooldownPeriod also in this case. I think that we check if lastActive has value, but we could just assign a default value. WDYT @kedacore/keda-core-contributors ?

JorTurFer avatar Dec 06 '23 21:12 JorTurFer

This is implementable, but probably as a new setting, to not break existing behavior?

zroubalik avatar Dec 07 '23 17:12 zroubalik

The bug has become into a feature? xD Yep, we can use a new field for it

JorTurFer avatar Dec 07 '23 17:12 JorTurFer

The bug has become into a feature? xD Yep, we can use a new field for it

Well, it is there since the beginning 🤷‍♂️ 😄 I am open to discussion.

zroubalik avatar Dec 07 '23 17:12 zroubalik

The workaround we have in place right now since we deploy scaled objects with an operator, is to not add the idleReplicas while the scaled object creationTimestamp is less than then cooldownPeriod. After we set the idleReplicas to zero.

pintonunes avatar Dec 20 '23 13:12 pintonunes

I have a problem. When I use scaledobject to operate deployment, the cooling value configured when creating is normal. When updated, the modified cooling value will no longer take effect. The minimum value of the copy will not take effect when it is set to one, but it will work when it is zero. #5321

528548004 avatar Dec 30 '23 01:12 528548004

@JorTurFer Does cooldownPeriod only take effect when minReplicaCount is equal to 0?

528548004 avatar Jan 02 '24 02:01 528548004

Yes, it only works when minReplicaCount or idleReplicaCount are zero

JorTurFer avatar Jan 02 '24 08:01 JorTurFer

@JorTurFer Hello, any plan for this fix? thanks.

thincal avatar Jan 22 '24 15:01 thincal

I'm not sure if there is consensus about how to fix it. @zroubalik , is a new field like initialDelay the way to go?

Once there is a solution agreed, anyone who is willing to contribute can help with the fix

JorTurFer avatar Jan 22 '24 16:01 JorTurFer

Yeah a new field, maybe initialCooldownPeriod, to be consistent in naming?

And maybe put it into advanced section? Not sure.

zroubalik avatar Jan 29 '24 09:01 zroubalik

support the proposal

lgy1027 avatar Feb 02 '24 07:02 lgy1027

Is there any progress here? really need this feature :) thanks.

thincal avatar Mar 20 '24 05:03 thincal

The feature is almost ready, small changes are pending (but KubeCon has been in the middle)

JorTurFer avatar Mar 25 '24 22:03 JorTurFer