strimzi-kafka-operator
                                
                                 strimzi-kafka-operator copied to clipboard
                                
                                    strimzi-kafka-operator copied to clipboard
                            
                            
                            
                        [Enhancement]: Pass all nodes to KafkaRoller when doing rolling update
Related problem
Currently KafkaReconciler pass all the nodes of the cluster to KafkaRoller to evaluate and roll them if needed when doing rolling update here. However when doing manual rolling update, we pass only the subset of the nodes that have the annotation applied here. This means when doing rolling update, we are not checking the health of the other pods that may get affected.
Suggested solution
I think we should always pass all the nodes to KafkaRoller and let it evaluate them based on their predicates. Change the code here, so that we apply the RestartReason.MANUAL_ROLLING_UPDATE to the subset of the nodes that need manual rolling update, but pass full set of nodes to KafkaRoller.
Alternatives
No response
Additional context
No response
In what sense are we not evaluating them? The availability should be checked regardless which node is included and rolled.
In what sense are we not evaluating them? The availability should be checked regardless which node is included and rolled.
The availability is checked, yes. KafkaRoller also checks if there are unready or stuck nodes and try to resolve those as well. However if we are doing manual rolling updates, this check is not happening for the other pods.
I'm not sure that is an issue - I think that is intentional. The manual rolling update is a special request for rolling not done just because of a configuration change.
Triaged on the community call on 22.2.2024: @katheris suggests this might be a problem with how the controller quorum is checked in KRaft mode with dedicated controllers or mixed nodes. We should keep this in triage and @katheris and @tinaselenge will try to double-check if there is any issue with controller roles or not.
Triaged on the community call on 21.3.2024: @katheris and @tinaselenge were busy and will double-check if there is any issue with controller roles or not as soon as possible.
After discussing with @katheris, we agreed that this is not necessary. We hit an issue setting up admin client connection for KRaft controllers because of this, but we were able to work around this so this is not a problem anymore.