strimzi-kafka-operator Add topic replication factor change

This change implements the proposal about topic replication factor change. https://github.com/strimzi/proposals/blob/main/066-topic-replication-factor-change.md

Documentation will be added in a separate PR.

To avoid making the review of new components harder, there is a small amount of duplication related to Cruise Control client. This will be addressed in a follow up PR, where I plan to move the shared part between Rebalance and Topic operator in the operator-common module, and to switch from Vertx HTTP client to Java HTTP client.

Unit tests are included, but I also performed a number of manual tests on Minikube and OpenShift that could help in testing the feature, and creating system tests:

Deleting secret $CLUSTER_NAME-cruise-control-api should roll CC
Deleting secret $CLUSTER_NAME-entity-topic-operator-cc-api should roll both EO and CC
Setting more replicas than available brokers should report an error in .status.replicasChange.message and reverting the change should clear that status
Setting .spec.config."min.insync.replicas" < .spec.replicas should work logging a warning
Bulk replicas change should take a reasonable amount of time (tested with 100 topics with 100MB each, and it took about 5-8 minutes on my local test environment)
An ongoing replicas change should be recovered after TO crash
An ongoing replicas change should be recovered after CC crash

Dec 21 '23 17:12 fvaleri

/azp run regression

Feb 15 '24 16:02 scholzj

Azure Pipelines successfully started running 1 pipeline(s).

Feb 15 '24 16:02 azure-pipelines[bot]

/azp run regression

Feb 19 '24 22:02 scholzj

Azure Pipelines successfully started running 1 pipeline(s).

Feb 19 '24 22:02 azure-pipelines[bot]

@scholzj your comments should be addressed. Thanks.

@tombentley @see-quick @ppatierno @kyguy can you have a look when you have time? Thanks.

One more think ... I did not saw anywhere any check for UTO versus BTO in the code. So, what is the expectation with regards to that?

Right, there were some, but mostly missing. Fixed.

Feb 23 '24 16:02 fvaleri

STs related

I was wondering about the STs and this is a possible approach:

Based on your input Federico I have prepared maybe possible test cases for the system level, I have not started with implementation yet but it’s just my idea of what we could implement so:

Test Cruise Control and Entity Operator Reaction to Secret Deletion
- We can group both cc and eo secret deletion and see the reaction to it
- We expect that for:
  - CC - restarts automatically, ensuring the system adapts to secret changes.
  - TO secret - that both the Entity Operator (EO) and Cruise Control (CC) restart.
Test Replicas Exceeding Broker Count
- We already have test case for this so I think we would need to just add a check to replicaChange status property when CC is enabled
Test Configuration of min.insync.replicas:
- I am not sure we want to cover this case or we could easily do it in IT level WDYT? I mean having such test case just checking warning wconsistencyould be not so great I think. On the other hand, having a more comprehensive test case also checking some potential risks in faul tolerance or data consistency is more to Kafka level then in Strimzi I think…
Recovery test case
- We could make a recovery scenario during RF change where both for CC and EO crash where we would see that:
  - CC - After CC restarts, the TO should detect the event and re-initiate the replication factor change tasks, ensuring the system's ability to recover from failures.
  - TO - it should recover the ongoing replication factor change process without losing its state, ensuring resilience.

We can also include clients in some of the test cases but not sure how beneficial it would be... :)

Side note (performance and chaos possibilities): 

Performance view:

As you did your specific test cases I thought about it in a more general way to see how it’s behaving. We could now also start performance on Replicas Change. How well is TO with CC doing…One scenario which comes to my mind:

Initiate a bulk replication factor change for topics where KafkaTopic would have MB/GB size.

And see how x and y are behaving and scales e.g., if we increase x, y, or both of them what would be the outcome (i.e., linear, quadratic increase, exponential… etc.)

Chaos engineering view:

Simulate network partitions within the Kafka cluster that isolate some brokers, also include scenarios where we can isolate the broker is a leader for some partitions and it’s involved in the replication factor change. We can expect that TO should handle such a scenario that Kafka would elect a new leader for affected partitions…

Implementation details...

We could have a shared Kafka cluster to reduce execution time but for the cost of something crashes (i.e., the Kafka cluster would be non-responsible because of some test case failure) it would cause all test cases will fail because of that one failure.  I think still we can decide that we need dedicated Kafka clusters for each test case (especially in recovery scenarios) but see what others think about it.

Mar 01 '24 12:03 see-quick

Hey @see-quick, thanks for working on the system test.

Test Cruise Control and Entity Operator Reaction to Secret Deletion

These are not common use cases, so I would leave them last, or add in a separate PR?

Test Replicas Exceeding Broker Count

Yes, specifically the .status.replicasChange.message and .status.replicasChange.state that should remain pending.

Test Configuration of min.insync.replicas:

I agree that we can add this one at the IT level. I'll look into that.

Recovery test case

These must definitely be there.

Initiate a bulk replication factor change for topics where KafkaTopic would have MB/GB size.

It is great to also have performance tests, but we can have a dedicated class and PR for that, after the main ST is complete. Wdyt?

With bulk tests I was thinking more about the total number of topics with replicas changes (better if with some random data). We want all of them to be reconciled eventually.

We can expect that TO should handle such a scenario that Kafka would elect a new leader for affected partitions…

This is not handled by the TO, but by Cruise Control, so I don't think we have to test it.

Mar 01 '24 13:03 fvaleri

These are not common use cases, so I would leave them last, or add in a separate PR?

Yeah sure.

Yes, specifically the .status.replicasChange.message and .status.replicasChange.state that should remain pending.

cool that should be fairly simple

I agree that we can add this one at the IT level. I'll look into that.

Thanks

These must definitely be there.

will do

It is great to also have performance tests, but we can have a dedicated class and PR for that, after the main ST is complete. Wdyt? agree, it was not my intention to have these now but just food for thought

With bulk tests I was thinking more about the total number of topics with replicas changes (better if with some random data). We want all of them to be reconciled eventually.

So do we want to have this automated in scope of ST?

This is not handled by the TO, but by Cruise Control, so I don't think we have to test it.

I see thanks.

Mar 01 '24 13:03 see-quick

So do we want to have this automated in scope of ST?

A dedicated PR for the ST is fine. I was just suggesting a third PR for performance tests.

Mar 01 '24 13:03 fvaleri

Keep in mind that performance and scale tests don't work well on our infra. Not sure if QE is running them somewhere outside of Azure, but we should probably not write something that we will never run. So it should be thought through.

Mar 01 '24 13:03 scholzj

Keep in mind that performance and scale tests don't work well on our infra. Not sure if QE is running them somewhere outside of Azure, but we should probably not write something that we will never run. So it should be thought through

Yeah, I think a proposal for such tests (e.g., performance) would be a good start to answer a questions and more...

A dedicated PR for the ST is fine. I was just suggesting a third PR for performance tests.

Totally agree :)

Mar 01 '24 14:03 see-quick

@see-quick found a corner case while creating the ST. Thanks!

In short, the change status was not removed after CC restart when the change was completed, but not yet detected by the TO, and there were no further topic changes.

This is fixed by the last commit.

Mar 06 '24 13:03 fvaleri

/azp run regression

Mar 06 '24 16:03 scholzj

Azure Pipelines successfully started running 1 pipeline(s).

Mar 06 '24 16:03 azure-pipelines[bot]

/azp run feature-gates-regression

Mar 07 '24 00:03 scholzj

Azure Pipelines successfully started running 1 pipeline(s).

Mar 07 '24 00:03 azure-pipelines[bot]

Still working on the corner case fix, as the first attempt was not successful. The ST that is triggering the issue is testRecoveryOfReplicationChangeDuringCcCrash.

With the fix I'm working on right now, STs are working consistently over multiple runs. I'll make sure to add a unit test before pushing.

Mar 07 '24 12:03 fvaleri

The issue should be fixed.

@scholzj can you launch regression tests when you have time?

@see-quick images on my own registry are aligned if you want to do another test.

Thanks.

Mar 07 '24 19:03 fvaleri

/azp run regression

Mar 07 '24 20:03 ppatierno

Azure Pipelines successfully started running 1 pipeline(s).

Mar 07 '24 20:03 azure-pipelines[bot]

The issue should be fixed.

@scholzj can you launch regression tests when you have time?

@see-quick images on my own registry are aligned if you want to do another test.

Thanks.

Everything is working fine thanks Fede.

Mar 07 '24 22:03 see-quick

/azp run kraft-regression

Mar 08 '24 09:03 scholzj

Azure Pipelines successfully started running 1 pipeline(s).

Mar 08 '24 09:03 azure-pipelines[bot]

/azp run feature-gates-regression

Mar 08 '24 14:03 scholzj

Azure Pipelines successfully started running 1 pipeline(s).

Mar 08 '24 14:03 azure-pipelines[bot]

strimzi-kafka-operator strimzi-kafka-operator copied to clipboard

Add topic replication factor change

STs related

Side note (performance and chaos possibilities):

Implementation details...

strimzi-kafka-operator
strimzi-kafka-operator copied to clipboard

Side note (performance and chaos possibilities):