karmada
karmada copied to clipboard
fix WorkStatusController, reconcile work status, when member cluster …
…become ready again.
What type of PR is this?
/kind bug
What this PR does / why we need it:
When i use karmada to manage a cluster (member1), karmada cannot access to member1 for some time with some network reason. when network is restored, we see member1 become ready again on karmada, but the status of resource in member cluster cannot synchronize to karmada in time.
This pull try to fix it by making WorkStatusController aware of cluster status.
Which issue(s) this PR fixes: Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Welcome @caden2016! It looks like this is your first PR to karmada-io/karmada 🎉
:warning: Please install the to ensure uploads and comments are reliably processed by Codecov.
Codecov Report
Attention: Patch coverage is 12.90323% with 27 lines in your changes missing coverage. Please review.
Project coverage is 53.23%. Comparing base (
37dd12f) to head (e2077b4). Report is 1021 commits behind head on master.
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@ Coverage Diff @@
## master #4991 +/- ##
==========================================
- Coverage 53.32% 53.23% -0.09%
==========================================
Files 253 253
Lines 20560 20590 +30
==========================================
- Hits 10963 10962 -1
- Misses 8871 8901 +30
- Partials 726 727 +1
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 53.23% <12.90%> (-0.09%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
You need to sign the DCO.
https://github.com/karmada-io/karmada/pull/4991/checks?check_run_id=25487819915
I'm not sure if the failing cases are related, just give it another try. /retest
I'm not sure if the failing cases are related, just give it another try. /retest
I don't know why e2e failed, can you help me check if it is related to the code i changed?
Hi @caden2016
but the status of resource in member cluster cannot synchronize to karmada in time.
How did you test this? Because it seems to me that the cluster informer is not currently deleted after the cluster notReady, and the eventhandler registered on the cluster informer will still work when the cluster network is restored.
Hi @caden2016
but the status of resource in member cluster cannot synchronize to karmada in time.
How did you test this? Because it seems to me that the cluster informer is not currently deleted after the cluster notReady, and the eventhandler registered on the cluster informer will still work when the cluster network is restored.
I test not with the newest version. I make the network between karmada and cluster member unreachable, change the resource status in cluster member1, found the resource status in karmada remain unchanged. But when network restored, the status is reconciled to karmada in time. What i change is making WorkStatusController be aware of cluster status, reconcile all work, starting informers.
I found the version i test have code likes. does it matters ?
func (c *ClusterStatusController) syncClusterStatus(cluster *clusterv1alpha1.Cluster) (controllerruntime.Result, error) {
...
if !online && readyCondition.Status != metav1.ConditionTrue {
klog.V(2).Infof("Cluster(%s) still offline after %s, ensuring offline is set.",
cluster.Name, c.ClusterFailureThreshold.Duration)
c.InformerManager.Stop(cluster.Name)
setTransitionTime(cluster.Status.Conditions, readyCondition)
meta.SetStatusCondition(¤tClusterStatus.Conditions, *readyCondition)
return c.updateStatusIfNeeded(cluster, currentClusterStatus)
}
...
}
But when network restored, the status is reconciled to karmada in time.
Is this what happens after you modify it?
What i change is making WorkStatusController be aware of cluster status, reconcile all work, starting informers.
There may be a lot of work, which affects the performance of the controller.
Is this what happens after you modify it?
yes
There may be a lot of work, which affects the performance of the controller.
Maybe a lot of resources informer will start, but i think it is suitable for WorkStatusController to do this.
How did you test this? Because it seems to me that the cluster informer is not currently deleted after the cluster notReady, and the eventhandler registered on the cluster informer will still work when the cluster network is restored.
I think the informer must be deleted in case of lots of informer reconnect trying. like 25df726 , so my changes works to resotre. Why the e2e errors, some cause by waiting for resources disappears on member clusters ?
Is this what happens after you modify it?
yes
What was the behavior before the change?
Is this what happens after you modify it?
yes
What was the behavior before the change?
I don't know with the newest code, I try with the old version v1.2.0, it cannot work.
I think if master add back 25df726 will not work too, since not one to restart the informer.
The reason to add back stop informers in clusterstatuscontroller is that I think it will cause lots of informer reconnect trying if network unreachable.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign garrybest, lonelycz for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/retest
@caden2016: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.
In response to this:
/retest
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/ok-to-test
/retest
Full Stack Trace
------------------------------
[SynchronizedAfterSuite] PASSED [0.004 seconds]
[SynchronizedAfterSuite]
/home/runner/work/karmada/karmada/test/e2e/suite_test.go:170
------------------------------
• [FAILED] [435.094 seconds]
[BasicPropagation] propagation testing Service propagation testing [DeferCleanup (Each)] service propagation testing
[DeferCleanup (Each)] /home/runner/work/karmada/karmada/test/e2e/propagationpolicy_test.go:131
[It] /home/runner/work/karmada/karmada/test/e2e/propagationpolicy_test.go:138
Captured StdOut/StdErr Output >>
I0604 15:30:04.853436 47187 service.go:55] Waiting for service(karmadatest-97mbg/service-rgnqm) synced on cluster(member1)
I0604 15:30:09.860442 47187 service.go:55] Waiting for service(karmadatest-97mbg/service-rgnqm) synced on cluster(member2)
I0604 15:30:09.863970 47187 service.go:55] Waiting for service(karmadatest-97mbg/service-rgnqm) synced on cluster(member3)
I0604 15:30:09.871770 47187 service.go:55] Waiting for service(karmadatest-97mbg/service-rgnqm) synced on cluster(member1)
I0604 15:30:14.878639 47187 service.go:55] Waiting for service(karmadatest-97mbg/service-rgnqm) synced on cluster(member2)
I0604 15:30:14.881384 47187 service.go:55] Waiting for service(karmadatest-97mbg/service-rgnqm) synced on cluster(member3)
I0604 15:30:14.899192 47187 service.go:79] Waiting for service(karmadatest-97mbg/service-rgnqm) disappear on cluster(member1)
I0604 15:30:19.904891 47187 service.go:79] Waiting for service(karmadatest-97mbg/service-rgnqm) disappear on cluster(member2)
I0604 15:30:19.906885 47187 service.go:79] Waiting for service(karmadatest-97mbg/service-rgnqm) disappear on cluster(member3)
<< Captured StdOut/StdErr Output
Timeline >>
STEP: Creating PropagationPolicy(karmadatest-97mbg/service-rgnqm) @ 06/04/24 15:30:04.815
STEP: Creating Service(karmadatest-97mbg/service-rgnqm) @ 06/04/24 15:30:04.831
STEP: Waiting for service(karmadatest-97mbg/service-rgnqm) synced on member clusters @ 06/04/24 15:30:04.853
STEP: Updating service(karmadatest-97mbg/service-rgnqm) @ 06/04/24 15:30:09.867
STEP: Waiting for service(karmadatest-97mbg/service-rgnqm) synced on member clusters @ 06/04/24 15:30:09.871
STEP: Removing PropagationPolicy(karmadatest-97mbg/service-rgnqm) @ 06/04/24 15:30:14.884
STEP: Removing Service(karmadatest-97mbg/service-rgnqm) @ 06/04/24 15:30:14.889
STEP: Check if service(karmadatest-97mbg/service-rgnqm) diappeare on member clusters @ 06/04/24 15:30:14.899
[FAILED] in [DeferCleanup (Each)] - /home/runner/work/karmada/karmada/test/e2e/framework/service.go:91 @ 06/04/24 15:37:19.908
<< Timeline
[FAILED] Timed out after 420.001s.
Expected
<bool>: false
to equal
<bool>: true
In [DeferCleanup (Each)] at: /home/runner/work/karmada/karmada/test/e2e/framework/service.go:91 @ 06/04/24 15:37:19.908
Full Stack Trace
github.com/karmada-io/karmada/test/e2e/framework.WaitServiceDisappearOnCluster({0xc000359ae9, 0x7}, {0xc000161578, 0x11}, {0xc000c87f83, 0xd})
/home/runner/work/karmada/karmada/test/e2e/framework/service.go:91 +0x614
github.com/karmada-io/karmada/test/e2e/framework.WaitServiceDisappearOnClusters.func1()
/home/runner/work/karmada/karmada/test/e2e/framework/service.go:98 +0xaa
github.com/karmada-io/karmada/test/e2e/framework.WaitServiceDisappearOnClusters({0xc0008ac480, 0x3, 0x4}, {0xc000161578, 0x11}, {0xc000c87f83, 0xd})
/home/runner/work/karmada/karmada/test/e2e/framework/service.go:96 +0x20f
github.com/karmada-io/karmada/test/e2e.glob..func32.2.2.1()
/home/runner/work/karmada/karmada/test/e2e/propagationpolicy_test.go:134 +0x278
reflect.Value.call({0x3cc0000?, 0xc0007066f0?, 0xc000731558?}, {0x418d421, 0x4}, {0x69a5fa0, 0x0, 0xc000854000?})
/opt/hostedtoolcache/go/1.21.10/x64/src/reflect/value.go:596 +0x14ce
reflect.Value.Call({0x3cc0000?, 0xc0007066f0?, 0x13?}, {0x69a5fa0, 0x0, 0x0})
/opt/hostedtoolcache/go/1.21.10/x64/src/reflect/value.go:380 +0xb6
------------------------------
[SynchronizedAfterSuite] PASSED [0.048 seconds]
[SynchronizedAfterSuite]
/home/runner/work/karmada/karmada/test/e2e/suite_test.go:170
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
[SynchronizedAfterSuite] PASSED [0.110 seconds]
[SynchronizedAfterSuite]
/home/runner/work/karmada/karmada/test/e2e/suite_test.go:170
------------------------------
Summarizing 4 Failures:
[INTERRUPTED] [OverridePolicy] apply overriders testing [LabelsOverrider] apply labels overrider testing [It] deployment labelsOverride testing
/home/runner/work/karmada/karmada/test/e2e/overridepolicy_test.go:110
[INTERRUPTED] [CronFederatedHPA] CronFederatedHPA testing Test suspend rule in CronFederatedHPA [It] Test suspend rule with CronFederatedHPA
/home/runner/work/karmada/karmada/test/e2e/cronfederatedhpa_test.go:149
[FAIL] [BasicPropagation] propagation testing Service propagation testing [DeferCleanup (Each)] service propagation testing
/home/runner/work/karmada/karmada/test/e2e/framework/service.go:91
[FAIL] [BasicClusterPropagation] propagation testing ClusterRole propagation testing [DeferCleanup (Each)] clusterRole propagation testing
/home/runner/work/karmada/karmada/test/e2e/framework/rbac.go:174
Ran 95 of 189 Specs in 571.853 seconds
FAIL! - Interrupted by Other Ginkgo Process -- 91 Passed | 4 Failed | 0 Pending | 94 Skipped
**CI Workflow / e2e test (v1.28.0) (pull_request) ** Successful in 40m
hi i found v1.28.0 passed. I don't know why with this e2e log output. can you dump the error in karmada-controller-manager? or some place to get it ?
Hi @caden2016, are you still in this PR?