keda
keda copied to clipboard
Keep hpa active when one of triggers failed
Proposal
When we use multiple triggers, failure of one trigger leads to hpa failure. That can led to hpa stucking In case of prometheus scaler or any other external scaler
I would like to implement if it possible preventFailing
option (naming is discussable:) ) to triggers
section, that will prevent scaled object to be unready.
Would be nice if our prometheus cluster is running down and we still can use CPU scaler
I know that there are fallback option, but sometimes we need more flexible way than just scale up to static value. Also it doesn't work with all types of metrics.
Use-Case
For example we have Scaled object with these triggers and disabled fallback:
triggers:
- type: cpu
metricType: Utilization
metadata:
value: "50"
- type: prometheus
metadata:
query: rate(my_beatuful_metric[1m])
threshold: "3"
Lets imagine, that prometheus has network issues. Then hpa will stuck, because prometheus is unreachable
And Scaled Object stays with statuses: Ready: false, Active: false
.
So all triggers is crushed, because network issue happens for one trigger
Is this a feature you are interested in implementing yourself?
Yes
Anything else?
related issue https://github.com/kedacore/keda/issues/4533 Looks like once you want to implement such a option, but issue is staled now
Hello @Frezyy123 I think that having more control over the fallback, it's an interesting scenario to go deeper. I don't know how can we design this for being more flexible without being too complex. Do you have any idea/proposal about the design? (at high level)
I assume that it should be configure like:
triggers:
- type: cpu
metricType: Utilization
metadata:
value: "50"
- type: prometheus
metadata:
query: rate(my_beatuful_metric[1m])
threshold: "3"
skipFailing: true
Default behaviour is skipFailing: false
(I guess skipFailing is better name), because we don't want breaking changes. So if you don't enable this option it will work the same as before
So controller just exclude failing metric and handle with cpu trigger only. When prometheus is reachable prometheus trigger works again.
I am not sure that it will be easy to realise, so I need a little bit more time for research
Fallback will work the same. If at least one of triggers without skipFailing: true
fails, we set fallback values for hpa.
@tomkerkhove @zroubalik WDYT?
About the proposal, I'd move the skipFailing
property from metadata to the trigger spec directly
Yes, I think this is doable, we should move it out of the metadata section. I don't like the name though :)