zookeeper-operator icon indicating copy to clipboard operation
zookeeper-operator copied to clipboard

Check orphan PVC before updating statefulSet

Open hoyhbx opened this issue 2 years ago • 13 comments

Change log description

  • Check orphan PVCs are deleted before updating the statefulSet

Purpose of the change

Fixes #513

What the code does

(Detailed description of the code changes) If the ZooKeeper cluster is using Persistence storage and the reclaimPolicy is set to Delete, this change makes sure that the number of PVCs is not larger than the statefulSet replicas before updating the statefulSet.

How to verify it

The bug #513 cannot be reproduced after this commit. E2E test is added to prevent regression

hoyhbx avatar Feb 05 '23 21:02 hoyhbx

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (2c8bfec) 85.91% compared to head (257813a) 85.41%.

Files Patch % Lines
controllers/zookeepercluster_controller.go 90.62% 2 Missing and 1 partial :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #526      +/-   ##
==========================================
- Coverage   85.91%   85.41%   -0.51%     
==========================================
  Files          12       12              
  Lines        1633     1659      +26     
==========================================
+ Hits         1403     1417      +14     
- Misses        145      156      +11     
- Partials       85       86       +1     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Feb 05 '23 21:02 codecov[bot]

@hoyhbx Could you please increase the code coverage. Also the DCO check is failing

anishakj avatar Feb 28 '23 14:02 anishakj

I will try to write a system test to cover these two branches

hoyhbx avatar Mar 17 '23 03:03 hoyhbx

@anishakj, I wrote an e2e test to reproduce the issue. During the process, I realize the previous fix does not work. I want to understand the code better before implementing the fix.

I want to confirm my understand for the usage of ZooKeeper.Status.Replicas and ZooKeeper.Status.ReadyReplicas before I implement the fix. The two fields seem to be used to reflect the StatefulSet's Status.Replicas and Status.ReadyReplicas fields, but they are only updated in this function: https://github.com/pravega/zookeeper-operator/blob/28d1f6917d2284c5673503816e110cf25555bd49/controllers/zookeepercluster_controller.go#L251

I want to implement a fix to make the reconcileStatefulSet function to return early if there are still orphan PVCs, by comparing the ZooKeeper.Spec.Replicas to ZooKeeper.Status.Replicas.
The orphanPVC cleanup should be triggered by comparing against ZooKeeper.Status.Replicas instead of ZooKeeper.Spec.Replicas here: https://github.com/pravega/zookeeper-operator/blob/28d1f6917d2284c5673503816e110cf25555bd49/controllers/zookeepercluster_controller.go#L717

But this doesn't work if the ZooKeeper.Status.Replicas is updated at the end of the reconcileStatefulSet function, because it would be skipped along with the StatefulSet update. I propose to update the ZooKeeper.Status.Replicas and ZooKeeper.Status.Replicas every reconcilation cycle, by either moving it earlier in the reconcileStatefulSet function, or move it to the reconcileClusterStatus function.

hoyhbx avatar Mar 18 '23 22:03 hoyhbx

Please check the latest commit for the tentative fix. I moved the ZooKeeper.Status.Replicas and ZooKeeper.Status.ReadyReplicas update to the reconcileClusterStatus. For cleaning up orphanPVC, the fix gets the up to date StatefulSet to check for the need of cleaning up. It checks for the STS.Spec.Replicas against PVC count, because I found that in some corner cases STS.Status.Replicas may not up to date, causing it to delete the first PVC when starting the StatefulSet initially.

The e2e test added can reproduce the issue without the patch, and it passes after the patch

hoyhbx avatar Mar 19 '23 00:03 hoyhbx

@hoyhbx Could you please increase the code coverage for this?

anishakj avatar Apr 11 '23 02:04 anishakj

@anishakj I added unittests to test the PVC deletion logics and fixed the e2e test

hoyhbx avatar Apr 11 '23 18:04 hoyhbx

Hi @anishakj , have you gotten a chance to look at the changes? We are happy to make improvements if you have some suggestions.

hoyhbx avatar Apr 27 '23 07:04 hoyhbx

Hi @anishakj , have you gotten a chance to look at the changes? We are happy to make improvements if you have some suggestions.

will perform some more tests from my side and let you know

anishakj avatar Apr 27 '23 07:04 anishakj

Thanks @anishakj ! Just let us know if there is anything we could improve. We are also more than happy to fix the issue in the zookeeperStart.sh mentioned here: https://github.com/pravega/zookeeper-operator/issues/513#issuecomment-1502246369

hoyhbx avatar Apr 27 '23 07:04 hoyhbx

Hi @anishakj , did you encounter any problem when testing this PR? Is there anything we can help?

hoyhbx avatar May 11 '23 17:05 hoyhbx

@anishakj - Is this part of the release v0.2.15. I don't see that as part of the change log. If not, any plans to include it in the new release? Thanks

iampranabroy avatar Nov 02 '23 04:11 iampranabroy

@anishakj - Is this part of the release v0.2.15. I don't see that as part of the change log. If not, any plans to include it in the new release? Thanks

@hoyhbx Could you please increase the coverage

anishakj avatar Dec 14 '23 08:12 anishakj