strimzi-kafka-operator
strimzi-kafka-operator copied to clipboard
Add topic replication factor change
This change implements the proposal about topic replication factor change. https://github.com/strimzi/proposals/blob/main/066-topic-replication-factor-change.md
Documentation will be added in a separate PR.
To avoid making the review of new components harder, there is a small amount of duplication related to Cruise Control client. This will be addressed in a follow up PR, where I plan to move the shared part between Rebalance and Topic operator in the operator-common module, and to switch from Vertx HTTP client to Java HTTP client.
Unit tests are included, but I also performed a number of manual tests on Minikube and OpenShift that could help in testing the feature, and creating system tests:
- Deleting secret
$CLUSTER_NAME-cruise-control-api
should roll CC - Deleting secret
$CLUSTER_NAME-entity-topic-operator-cc-api
should roll both EO and CC - Setting more replicas than available brokers should report an error in
.status.replicasChange.message
and reverting the change should clear that status - Setting
.spec.config."min.insync.replicas" < .spec.replicas
should work logging a warning - Bulk replicas change should take a reasonable amount of time (tested with 100 topics with 100MB each, and it took about 5-8 minutes on my local test environment)
- An ongoing replicas change should be recovered after TO crash
- An ongoing replicas change should be recovered after CC crash
/azp run regression
Azure Pipelines successfully started running 1 pipeline(s).
/azp run regression
Azure Pipelines successfully started running 1 pipeline(s).
@scholzj your comments should be addressed. Thanks.
@tombentley @see-quick @ppatierno @kyguy can you have a look when you have time? Thanks.
One more think ... I did not saw anywhere any check for UTO versus BTO in the code. So, what is the expectation with regards to that?
Right, there were some, but mostly missing. Fixed.
STs related
I was wondering about the STs and this is a possible approach:
Based on your input Federico I have prepared maybe possible test cases for the system level, I have not started with implementation yet but it’s just my idea of what we could implement so:
- Test Cruise Control and Entity Operator Reaction to Secret Deletion
- We can group both cc and eo secret deletion and see the reaction to it
- We expect that for:
- CC - restarts automatically, ensuring the system adapts to secret changes.
- TO secret - that both the Entity Operator (EO) and Cruise Control (CC) restart.
- Test Replicas Exceeding Broker Count
- We already have test case for this so I think we would need to just add a check to replicaChange status property when CC is enabled
- Test Configuration of min.insync.replicas:
- I am not sure we want to cover this case or we could easily do it in IT level WDYT? I mean having such test case just checking warning wconsistencyould be not so great I think. On the other hand, having a more comprehensive test case also checking some potential risks in faul tolerance or data consistency is more to Kafka level then in Strimzi I think…
- Recovery test case
- We could make a recovery scenario during RF change where both for CC and EO crash where we would see that:
- CC - After CC restarts, the TO should detect the event and re-initiate the replication factor change tasks, ensuring the system's ability to recover from failures.
- TO - it should recover the ongoing replication factor change process without losing its state, ensuring resilience.
- We could make a recovery scenario during RF change where both for CC and EO crash where we would see that:
We can also include clients in some of the test cases but not sure how beneficial it would be... :)
Side note (performance and chaos possibilities):
Performance view:
As you did your specific test cases I thought about it in a more general way to see how it’s behaving. We could now also start performance on Replicas Change. How well is TO with CC doing…One scenario which comes to my mind:
- Initiate a bulk replication factor change for
topics where KafkaTopic would have MB/GB size.
And see how x and y are behaving and scales e.g., if we increase x, y, or both of them what would be the outcome (i.e., linear, quadratic increase, exponential… etc.)
Chaos engineering view:
Simulate network partitions within the Kafka cluster that isolate some brokers, also include scenarios where we can isolate the broker is a leader for some partitions and it’s involved in the replication factor change. We can expect that TO should handle such a scenario that Kafka would elect a new leader for affected partitions…
Implementation details...
We could have a shared Kafka cluster to reduce execution time but for the cost of something crashes (i.e., the Kafka cluster would be non-responsible because of some test case failure) it would cause all test cases will fail because of that one failure. I think still we can decide that we need dedicated Kafka clusters for each test case (especially in recovery scenarios) but see what others think about it.
Hey @see-quick, thanks for working on the system test.
Test Cruise Control and Entity Operator Reaction to Secret Deletion
These are not common use cases, so I would leave them last, or add in a separate PR?
Test Replicas Exceeding Broker Count
Yes, specifically the .status.replicasChange.message and .status.replicasChange.state that should remain pending.
Test Configuration of min.insync.replicas:
I agree that we can add this one at the IT level. I'll look into that.
Recovery test case
These must definitely be there.
Initiate a bulk replication factor change for topics where KafkaTopic would have MB/GB size.
It is great to also have performance tests, but we can have a dedicated class and PR for that, after the main ST is complete. Wdyt?
With bulk tests I was thinking more about the total number of topics with replicas changes (better if with some random data). We want all of them to be reconciled eventually.
We can expect that TO should handle such a scenario that Kafka would elect a new leader for affected partitions…
This is not handled by the TO, but by Cruise Control, so I don't think we have to test it.
These are not common use cases, so I would leave them last, or add in a separate PR?
Yeah sure.
Yes, specifically the .status.replicasChange.message and .status.replicasChange.state that should remain pending.
cool that should be fairly simple
I agree that we can add this one at the IT level. I'll look into that.
Thanks
These must definitely be there.
will do
It is great to also have performance tests, but we can have a dedicated class and PR for that, after the main ST is complete. Wdyt? agree, it was not my intention to have these now but just food for thought
With bulk tests I was thinking more about the total number of topics with replicas changes (better if with some random data). We want all of them to be reconciled eventually.
So do we want to have this automated in scope of ST?
This is not handled by the TO, but by Cruise Control, so I don't think we have to test it.
I see thanks.
So do we want to have this automated in scope of ST?
A dedicated PR for the ST is fine. I was just suggesting a third PR for performance tests.
Keep in mind that performance and scale tests don't work well on our infra. Not sure if QE is running them somewhere outside of Azure, but we should probably not write something that we will never run. So it should be thought through.
Keep in mind that performance and scale tests don't work well on our infra. Not sure if QE is running them somewhere outside of Azure, but we should probably not write something that we will never run. So it should be thought through
Yeah, I think a proposal for such tests (e.g., performance) would be a good start to answer a questions and more...
A dedicated PR for the ST is fine. I was just suggesting a third PR for performance tests.
Totally agree :)
@see-quick found a corner case while creating the ST. Thanks!
In short, the change status was not removed after CC restart when the change was completed, but not yet detected by the TO, and there were no further topic changes.
This is fixed by the last commit.
/azp run regression
Azure Pipelines successfully started running 1 pipeline(s).
/azp run feature-gates-regression
Azure Pipelines successfully started running 1 pipeline(s).
Still working on the corner case fix, as the first attempt was not successful.
The ST that is triggering the issue is testRecoveryOfReplicationChangeDuringCcCrash
.
With the fix I'm working on right now, STs are working consistently over multiple runs. I'll make sure to add a unit test before pushing.
The issue should be fixed.
@scholzj can you launch regression tests when you have time?
@see-quick images on my own registry are aligned if you want to do another test.
Thanks.
/azp run regression
Azure Pipelines successfully started running 1 pipeline(s).
The issue should be fixed.
@scholzj can you launch regression tests when you have time?
@see-quick images on my own registry are aligned if you want to do another test.
Thanks.
Everything is working fine thanks Fede.
/azp run kraft-regression
Azure Pipelines successfully started running 1 pipeline(s).
/azp run feature-gates-regression
Azure Pipelines successfully started running 1 pipeline(s).