keda icon indicating copy to clipboard operation
keda copied to clipboard

Keep hpa active when one of triggers failed

Open Frezyy123 opened this issue 3 months ago • 4 comments

Proposal

When we use multiple triggers, failure of one trigger leads to hpa failure. That can led to hpa stucking In case of prometheus scaler or any other external scaler I would like to implement if it possible preventFailing option (naming is discussable:) ) to triggers section, that will prevent scaled object to be unready. Would be nice if our prometheus cluster is running down and we still can use CPU scaler I know that there are fallback option, but sometimes we need more flexible way than just scale up to static value. Also it doesn't work with all types of metrics.

Use-Case

For example we have Scaled object with these triggers and disabled fallback:

 triggers:
    - type: cpu
      metricType: Utilization
      metadata:
        value: "50"
    - type: prometheus
      metadata:
         query: rate(my_beatuful_metric[1m])
         threshold: "3"

Lets imagine, that prometheus has network issues. Then hpa will stuck, because prometheus is unreachable And Scaled Object stays with statuses: Ready: false, Active: false. So all triggers is crushed, because network issue happens for one trigger

Is this a feature you are interested in implementing yourself?

Yes

Anything else?

related issue https://github.com/kedacore/keda/issues/4533 Looks like once you want to implement such a option, but issue is staled now

Frezyy123 avatar Apr 02 '24 14:04 Frezyy123

Hello @Frezyy123 I think that having more control over the fallback, it's an interesting scenario to go deeper. I don't know how can we design this for being more flexible without being too complex. Do you have any idea/proposal about the design? (at high level)

JorTurFer avatar Apr 03 '24 20:04 JorTurFer

I assume that it should be configure like:

triggers:
    - type: cpu
      metricType: Utilization
      metadata:
        value: "50"
    - type: prometheus
      metadata:
         query: rate(my_beatuful_metric[1m])
         threshold: "3"
         skipFailing: true

Default behaviour is skipFailing: false (I guess skipFailing is better name), because we don't want breaking changes. So if you don't enable this option it will work the same as before So controller just exclude failing metric and handle with cpu trigger only. When prometheus is reachable prometheus trigger works again. I am not sure that it will be easy to realise, so I need a little bit more time for research

Fallback will work the same. If at least one of triggers without skipFailing: true fails, we set fallback values for hpa.

Frezyy123 avatar Apr 04 '24 10:04 Frezyy123

@tomkerkhove @zroubalik WDYT?

About the proposal, I'd move the skipFailing property from metadata to the trigger spec directly

JorTurFer avatar Apr 04 '24 20:04 JorTurFer

Yes, I think this is doable, we should move it out of the metadata section. I don't like the name though :)

zroubalik avatar Apr 08 '24 22:04 zroubalik