strimzi-kafka-operator icon indicating copy to clipboard operation
strimzi-kafka-operator copied to clipboard

[Enhancement] Start using CC self heal features

Open brechtvalcke opened this issue 3 years ago • 24 comments

Hi,

I have been running a kafka cluster with strimzi for a while now. However a few months ago we moved to preemtible vm's in Google cloud. This means that kafka brokers get terminated at least one time a day.

If I could enable the self heal in cruise control downtime could be reduced to unnoticeable proportions. However I see you filter that property in the documentation.

Why is that? Only other solution I see is building a service that detects broker outage manually and than triggers a rebalance. However this would not be needed if I could just enable this one property...

brechtvalcke avatar Sep 03 '20 15:09 brechtvalcke

We do plan to support self-healing at some point in the future. It is disabled for the initial CC integration as self-healing can make changes to the cluster that the Strimzi Cluster Operator may not be aware of and/or would have no control over.

For example, if CC's self-healing were to begin a large rebalance and then Strimzi decided (or was instructed) to roll the cluster, upgrade Kafka or perform some other disruptive operation there is no way (currently) for the CO to check with CC before proceeding or to signal CC to wait. At the moment, when we do a whole cluster rebalance, we are locking the CO from performing any other operations until the rebalance is complete so that we avoid this situation.

We could get the CO to check for ongoing executions with CC via the REST API or get CC to signal the CO via some form of notification/annotation. But either way we would need to perform a check before every disruptive activity that the CO may perform (there are a lot). This is not an impossible task, we just need to design it properly make sure it fits with the rest of Strimzi's operations.

tomncooper avatar Sep 03 '20 16:09 tomncooper

tomncooper,

Thank you for the quick reply! I'll try to write something myself that takes care of my needs.

I understand why this currently isn't an option to enable it right now. Is this already in the roadmap or still distant future?

By the way just a quick note: Strimzi is awesome! The only good way to run kafka on kubernetes. Have struggeled to set it up myself for a ful year before I found your solution. It works like a charm!

brechtvalcke avatar Sep 04 '20 08:09 brechtvalcke

I understand why this currently isn't an option to enable it right now. Is this already in the roadmap or still distant future?

@tomncooper @ppatierno Is this something we plan to have long term?

scholzj avatar Sep 19 '20 21:09 scholzj

@tomncooper @ppatierno @kyguy ^^^? If we plan to enable some functionality around it in the future, we can change this to enhancement. If not, we should probably close it.

scholzj avatar Nov 16 '20 09:11 scholzj

I would say definitely yes on long term. So I agree to change this to enhancement.

ppatierno avatar Nov 16 '20 09:11 ppatierno

Changing this to enhancement since it is in the long term plans. No exact timeline right now.

scholzj avatar Nov 16 '20 09:11 scholzj

tomncooper,

Thank you for the quick reply! I'll try to write something myself that takes care of my needs.

@brechtvalcke. did you write something at last? If so, is this something you care to share?

Thanks in advance.

DanielShor avatar Aug 02 '21 14:08 DanielShor

Hi,

We started to work on the migration of huge Kafka clusters to K8S using the strimzi operator. One of our must-have features is the cruise-control self-healing. Do you have an ETA for that?

liorfranko avatar Aug 09 '21 10:08 liorfranko

I do not think there is any ETA.

scholzj avatar Aug 09 '21 11:08 scholzj

Thanks

liorfranko avatar Aug 09 '21 13:08 liorfranko

Bumping this since the self healing is a pretty important feature of CC and has some serious potential for support cost savings.

It looks like there's concern over disrupting CC actions with Strimzi operations. But I'd assume that something like a cluster rebalance being interrupted by a Strimzi roll would just result in CC trying another rebalance later to correct the cluster state anyhow no?

And if the worry is not having perfect, reliable interactions between Strimzi and CC, then maybe some sort of compromise like an experimental: true flag would be in order to allow operators using Strimzi to unlock the full potential of the products being used, but at their own risk.

lee-mcfaul avatar May 06 '22 10:05 lee-mcfaul

Triaged on 26.5.2022: This is on the roadmap, but would require a proposal to clarify the interaction between Cruise COntrol and the operator.

scholzj avatar May 26 '22 14:05 scholzj

Also interested in this feature, also interested in implementing it as I consider it very crucial.

synchris avatar Aug 31 '22 06:08 synchris

tomncooper, Thank you for the quick reply! I'll try to write something myself that takes care of my needs.

brechtvalcke. did you write something at last? If so, is this something you care to share?

Thanks in advance.

@DanielShor I ended up just moving the cluster to regular nodes and not writing anything myself. However now that Google has released Spot VM's it might be less of an issue.

I also have noticed that we highly under scaled our nodes. We had 3 very small nodes and this meant that the nodes didn't have enough CPU and RAM to handle the load balancing quickly.

brechtvalcke avatar Aug 31 '22 07:08 brechtvalcke

@synchris did you start looking into implementing this feature or opening a proposal?

katheris avatar Sep 26 '22 14:09 katheris

we should double check with @kyguy as well who seems to be the assignee for this.

ppatierno avatar Sep 26 '22 15:09 ppatierno

Hey @synchris, I was intending on taking on this task but got side tracked with some other work. Are you still interested in taking on this task?

kyguy avatar Sep 27 '22 22:09 kyguy

I am interested implementimg this feature as we want to use cruise control without switching to another operator

On Wed, 28 Sep 2022, 01:54 Kyle Liberti, @.***> wrote:

Hey @synchris https://github.com/synchris, I was intending on taking on this task but got side tracked with some other work shortly after. Are you still interested in taking on this task?

— Reply to this email directly, view it on GitHub https://github.com/strimzi/strimzi-kafka-operator/issues/3601#issuecomment-1260155382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAONCVUY3KWK65OVEQNYBDWAN3JFANCNFSM4QU46R3A . You are receiving this because you were mentioned.Message ID: @.***>

synchris avatar Sep 28 '22 08:09 synchris

I am interested implementimg this feature as we want to use cruise control without switching to another operator

That is awesome @synchris, I'll assign this task to you then! Being a complicated piece of work, this task will require a Strimzi proposal [1] so I would start with fleshing that out. Feel free to hit me up on Slack if you have any questions or if I can assist in any other way!

[1] https://github.com/strimzi/proposals

kyguy avatar Sep 28 '22 16:09 kyguy

We are actually thinking about setting up a Strimzi community call focused on the status of the rebalancing support with Cruise Control in Strimzi. The goal is to see the current status (by using the KafkaRebalance resource), a quick look at the auto-rebalancing on scale up/down proposal, a potential change on its current status and then discussing how the self-healing could have an impact and how much it conflicts with the other work which is going on. @synchris are you willing to join if you like to work on this? AFAICS, it seems you are based in Europe so I will get a good time for EMEA but trying to cover people from US as well (see @kyguy)

ppatierno avatar Sep 29 '22 08:09 ppatierno

We are actually thinking about setting up a Strimzi community call focused on the status of the rebalancing support with Cruise Control in Strimzi. The goal is to see the current status (by using the KafkaRebalance resource), a quick look at the auto-rebalancing on scale up/down proposal, a potential change on its current status and then discussing how the self-healing could have an impact and how much it conflicts with the other work which is going on. @synchris are you willing to join if you like to work on this? AFAICS, it seems you are based in Europe so I will get a good time for EMEA but trying to cover people from US as well (see @kyguy)

Please schedule the meeting after 4:30 pm CEST if it's possible.

synchris avatar Oct 02 '22 17:10 synchris

Hi @synchris, thanks for your reply! I am actually preparing a quick presentation/slides. When ready I will set up a community call to discuss it.

ppatierno avatar Oct 03 '22 13:10 ppatierno

Hi @synchris FYI I set up the call for next Thursday October 13rd at 5:00 PM CEST. It's available on the public Strimzi Community Meetings calendar on Google. Let me know if you can see it to get the Zoom link otherwise I will share with you.

ppatierno avatar Oct 04 '22 08:10 ppatierno

Yesterday we had the planned community meeting talking about Strimzi & Cruise Control integration. After me showing the current status and future planned improvements, we had a discussion mostly on the self-healing implementation and the ideas we would need to explore. @katheris together with @mimaison offered to investigate more on that part. More details are available in the recording here https://www.youtube.com/watch?v=kdUPK1zeei8

ppatierno avatar Oct 14 '22 08:10 ppatierno