keda icon indicating copy to clipboard operation
keda copied to clipboard

Increase operator resiliency to unexpected scaler failures

Open cyrilico opened this issue 3 months ago • 5 comments

Proposal

As a(n independent) follow-up to #5619 , I think it would be interesting to start a discussion on potential improvements to Keda operator's resiliency, more specifically in the case of unexpected/catastrophic scaler failures. In the linked issue, the problematic query caused an outage in our Keda operators which prevented all ScaledObjects in the cluster from operating correctly until my team was able to pinpoint the issue and remove that particular scaler configuration. While a quick fix for that specific scaler has been proposed, we worry that similar issues may arise in the future.

While we are not familiar enough with the codebase to immediately suggest potential paths, if you are able to provide some pointers and initial thoughts, we'd love to keep engaging and, if the opportunity arises, provide a contribution down the line 🙏

Use-Case

A higher operator resiliency to scaler failures

Is this a feature you are interested in implementing yourself?

Maybe

Anything else?

No response

cyrilico avatar Mar 25 '24 15:03 cyrilico

This is an interesting point. We aim to prevent all the panics by code instead of just recovering them, but maybe we could recover panics on scaler metric requests: https://github.com/kedacore/keda/blob/1e1cfb11d6ca826d7c083e9aba730e08f3bd24f4/pkg/scaling/cache/scalers_cache.go#L125-L142

As scalers are the place where more contributions are made, they're also the place with more unexpected problems and although I think that we should avoid panics, maybe in this case it can make sense.

In the other hand, we have a really few panics because we try to cover all the cases and we almost achieve it. WDYT @zroubalik @dttung2905 ?

JorTurFer avatar Mar 25 '24 22:03 JorTurFer

Yeah, we should avoid panics, I agree.

zroubalik avatar Apr 10 '24 16:04 zroubalik

Are you willing to open a PR with this recover @cyrilico ?

JorTurFer avatar Apr 10 '24 21:04 JorTurFer

I'll gladly take a shot at it whenever I get some time 🙏

cyrilico avatar Apr 11 '24 10:04 cyrilico