strimzi-kafka-operator icon indicating copy to clipboard operation
strimzi-kafka-operator copied to clipboard

[Enhancement]: Pass all nodes to KafkaRoller when doing rolling update

Open tinaselenge opened this issue 1 year ago • 3 comments

Related problem

Currently KafkaReconciler pass all the nodes of the cluster to KafkaRoller to evaluate and roll them if needed when doing rolling update here. However when doing manual rolling update, we pass only the subset of the nodes that have the annotation applied here. This means when doing rolling update, we are not checking the health of the other pods that may get affected.

Suggested solution

I think we should always pass all the nodes to KafkaRoller and let it evaluate them based on their predicates. Change the code here, so that we apply the RestartReason.MANUAL_ROLLING_UPDATE to the subset of the nodes that need manual rolling update, but pass full set of nodes to KafkaRoller.

Alternatives

No response

Additional context

No response

tinaselenge avatar Feb 16 '24 09:02 tinaselenge

In what sense are we not evaluating them? The availability should be checked regardless which node is included and rolled.

scholzj avatar Feb 16 '24 09:02 scholzj

In what sense are we not evaluating them? The availability should be checked regardless which node is included and rolled.

The availability is checked, yes. KafkaRoller also checks if there are unready or stuck nodes and try to resolve those as well. However if we are doing manual rolling updates, this check is not happening for the other pods.

tinaselenge avatar Feb 16 '24 09:02 tinaselenge

I'm not sure that is an issue - I think that is intentional. The manual rolling update is a special request for rolling not done just because of a configuration change.

scholzj avatar Feb 16 '24 09:02 scholzj

Triaged on the community call on 22.2.2024: @katheris suggests this might be a problem with how the controller quorum is checked in KRaft mode with dedicated controllers or mixed nodes. We should keep this in triage and @katheris and @tinaselenge will try to double-check if there is any issue with controller roles or not.

scholzj avatar Feb 22 '24 16:02 scholzj

Triaged on the community call on 21.3.2024: @katheris and @tinaselenge were busy and will double-check if there is any issue with controller roles or not as soon as possible.

ppatierno avatar Mar 21 '24 16:03 ppatierno

After discussing with @katheris, we agreed that this is not necessary. We hit an issue setting up admin client connection for KRaft controllers because of this, but we were able to work around this so this is not a problem anymore.

tinaselenge avatar Mar 26 '24 13:03 tinaselenge