karmada icon indicating copy to clipboard operation
karmada copied to clipboard

fix WorkStatusController, reconcile work status, when member cluster …

Open caden2016 opened this issue 1 year ago • 17 comments

…become ready again.

What type of PR is this?

/kind bug

What this PR does / why we need it:

When i use karmada to manage a cluster (member1), karmada cannot access to member1 for some time with some network reason. when network is restored, we see member1 become ready again on karmada, but the status of resource in member cluster cannot synchronize to karmada in time.

This pull try to fix it by making WorkStatusController aware of cluster status.

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


caden2016 avatar May 28 '24 07:05 caden2016

Welcome @caden2016! It looks like this is your first PR to karmada-io/karmada 🎉

karmada-bot avatar May 28 '24 07:05 karmada-bot

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 12.90323% with 27 lines in your changes missing coverage. Please review.

Project coverage is 53.23%. Comparing base (37dd12f) to head (e2077b4). Report is 1021 commits behind head on master.

Files with missing lines Patch % Lines
pkg/controllers/status/work_status_controller.go 0.00% 24 Missing :warning:
...kg/controllers/status/cluster_status_controller.go 57.14% 2 Missing and 1 partial :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4991      +/-   ##
==========================================
- Coverage   53.32%   53.23%   -0.09%     
==========================================
  Files         253      253              
  Lines       20560    20590      +30     
==========================================
- Hits        10963    10962       -1     
- Misses       8871     8901      +30     
- Partials      726      727       +1     
Flag Coverage Δ
unittests 53.23% <12.90%> (-0.09%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar May 28 '24 08:05 codecov-commenter

You need to sign the DCO.

https://github.com/karmada-io/karmada/pull/4991/checks?check_run_id=25487819915

chaunceyjiang avatar May 28 '24 09:05 chaunceyjiang

I'm not sure if the failing cases are related, just give it another try. /retest

RainbowMango avatar May 28 '24 16:05 RainbowMango

I'm not sure if the failing cases are related, just give it another try. /retest

I don't know why e2e failed, can you help me check if it is related to the code i changed?

caden2016 avatar May 29 '24 03:05 caden2016

Hi @caden2016

but the status of resource in member cluster cannot synchronize to karmada in time.

How did you test this? Because it seems to me that the cluster informer is not currently deleted after the cluster notReady, and the eventhandler registered on the cluster informer will still work when the cluster network is restored.

XiShanYongYe-Chang avatar May 29 '24 03:05 XiShanYongYe-Chang

Hi @caden2016

but the status of resource in member cluster cannot synchronize to karmada in time.

How did you test this? Because it seems to me that the cluster informer is not currently deleted after the cluster notReady, and the eventhandler registered on the cluster informer will still work when the cluster network is restored.

I test not with the newest version. I make the network between karmada and cluster member unreachable, change the resource status in cluster member1, found the resource status in karmada remain unchanged. But when network restored, the status is reconciled to karmada in time. What i change is making WorkStatusController be aware of cluster status, reconcile all work, starting informers.

I found the version i test have code likes. does it matters ?

func (c *ClusterStatusController) syncClusterStatus(cluster *clusterv1alpha1.Cluster) (controllerruntime.Result, error) {
...
	if !online && readyCondition.Status != metav1.ConditionTrue {
		klog.V(2).Infof("Cluster(%s) still offline after %s, ensuring offline is set.",
			cluster.Name, c.ClusterFailureThreshold.Duration)
		c.InformerManager.Stop(cluster.Name)
		setTransitionTime(cluster.Status.Conditions, readyCondition)
		meta.SetStatusCondition(&currentClusterStatus.Conditions, *readyCondition)
		return c.updateStatusIfNeeded(cluster, currentClusterStatus)
	}
...
}

caden2016 avatar May 29 '24 07:05 caden2016

But when network restored, the status is reconciled to karmada in time.

Is this what happens after you modify it?

What i change is making WorkStatusController be aware of cluster status, reconcile all work, starting informers.

There may be a lot of work, which affects the performance of the controller.

XiShanYongYe-Chang avatar May 29 '24 08:05 XiShanYongYe-Chang

Is this what happens after you modify it?

yes

There may be a lot of work, which affects the performance of the controller.

Maybe a lot of resources informer will start, but i think it is suitable for WorkStatusController to do this.

caden2016 avatar May 29 '24 09:05 caden2016

How did you test this? Because it seems to me that the cluster informer is not currently deleted after the cluster notReady, and the eventhandler registered on the cluster informer will still work when the cluster network is restored.

I think the informer must be deleted in case of lots of informer reconnect trying. like 25df726 , so my changes works to resotre. Why the e2e errors, some cause by waiting for resources disappears on member clusters ?

caden2016 avatar May 30 '24 02:05 caden2016

Is this what happens after you modify it?

yes

What was the behavior before the change?

XiShanYongYe-Chang avatar May 30 '24 03:05 XiShanYongYe-Chang

Is this what happens after you modify it?

yes

What was the behavior before the change?

I don't know with the newest code, I try with the old version v1.2.0, it cannot work.
I think if master add back 25df726 will not work too, since not one to restart the informer. The reason to add back stop informers in clusterstatuscontroller is that I think it will cause lots of informer reconnect trying if network unreachable.

caden2016 avatar May 30 '24 05:05 caden2016

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign garrybest, lonelycz for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

karmada-bot avatar Jun 04 '24 14:06 karmada-bot

/retest

caden2016 avatar Jun 06 '24 08:06 caden2016

@caden2016: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

karmada-bot avatar Jun 06 '24 08:06 karmada-bot

/ok-to-test

/retest

Full Stack Trace
------------------------------
[SynchronizedAfterSuite] PASSED [0.004 seconds]
[SynchronizedAfterSuite] 
/home/runner/work/karmada/karmada/test/e2e/suite_test.go:170
------------------------------
• [FAILED] [435.094 seconds]
[BasicPropagation] propagation testing Service propagation testing [DeferCleanup (Each)] service propagation testing
  [DeferCleanup (Each)] /home/runner/work/karmada/karmada/test/e2e/propagationpolicy_test.go:131
  [It] /home/runner/work/karmada/karmada/test/e2e/propagationpolicy_test.go:138

  Captured StdOut/StdErr Output >>
  I0604 15:30:04.853436   47187 service.go:55] Waiting for service(karmadatest-97mbg/service-rgnqm) synced on cluster(member1)
  I0604 15:30:09.860442   47187 service.go:55] Waiting for service(karmadatest-97mbg/service-rgnqm) synced on cluster(member2)
  I0604 15:30:09.863970   47187 service.go:55] Waiting for service(karmadatest-97mbg/service-rgnqm) synced on cluster(member3)
  I0604 15:30:09.871770   47187 service.go:55] Waiting for service(karmadatest-97mbg/service-rgnqm) synced on cluster(member1)
  I0604 15:30:14.878639   47187 service.go:55] Waiting for service(karmadatest-97mbg/service-rgnqm) synced on cluster(member2)
  I0604 15:30:14.881384   47187 service.go:55] Waiting for service(karmadatest-97mbg/service-rgnqm) synced on cluster(member3)
  I0604 15:30:14.899192   47187 service.go:79] Waiting for service(karmadatest-97mbg/service-rgnqm) disappear on cluster(member1)
  I0604 15:30:19.904891   47187 service.go:79] Waiting for service(karmadatest-97mbg/service-rgnqm) disappear on cluster(member2)
  I0604 15:30:19.906885   47187 service.go:79] Waiting for service(karmadatest-97mbg/service-rgnqm) disappear on cluster(member3)
  << Captured StdOut/StdErr Output

  Timeline >>
  STEP: Creating PropagationPolicy(karmadatest-97mbg/service-rgnqm) @ 06/04/24 15:30:04.815
  STEP: Creating Service(karmadatest-97mbg/service-rgnqm) @ 06/04/24 15:30:04.831
  STEP: Waiting for service(karmadatest-97mbg/service-rgnqm) synced on member clusters @ 06/04/24 15:30:04.853
  STEP: Updating service(karmadatest-97mbg/service-rgnqm) @ 06/04/24 15:30:09.867
  STEP: Waiting for service(karmadatest-97mbg/service-rgnqm) synced on member clusters @ 06/04/24 15:30:09.871
  STEP: Removing PropagationPolicy(karmadatest-97mbg/service-rgnqm) @ 06/04/24 15:30:14.884
  STEP: Removing Service(karmadatest-97mbg/service-rgnqm) @ 06/04/24 15:30:14.889
  STEP: Check if service(karmadatest-97mbg/service-rgnqm) diappeare on member clusters @ 06/04/24 15:30:14.899
  [FAILED] in [DeferCleanup (Each)] - /home/runner/work/karmada/karmada/test/e2e/framework/service.go:91 @ 06/04/24 15:37:19.908
  << Timeline

  [FAILED] Timed out after 420.001s.
  Expected
      <bool>: false
  to equal
      <bool>: true
  In [DeferCleanup (Each)] at: /home/runner/work/karmada/karmada/test/e2e/framework/service.go:91 @ 06/04/24 15:37:19.908

  Full Stack Trace
    github.com/karmada-io/karmada/test/e2e/framework.WaitServiceDisappearOnCluster({0xc000359ae9, 0x7}, {0xc000161578, 0x11}, {0xc000c87f83, 0xd})
    	/home/runner/work/karmada/karmada/test/e2e/framework/service.go:91 +0x614
    github.com/karmada-io/karmada/test/e2e/framework.WaitServiceDisappearOnClusters.func1()
    	/home/runner/work/karmada/karmada/test/e2e/framework/service.go:98 +0xaa
    github.com/karmada-io/karmada/test/e2e/framework.WaitServiceDisappearOnClusters({0xc0008ac480, 0x3, 0x4}, {0xc000161578, 0x11}, {0xc000c87f83, 0xd})
    	/home/runner/work/karmada/karmada/test/e2e/framework/service.go:96 +0x20f
    github.com/karmada-io/karmada/test/e2e.glob..func32.2.2.1()
    	/home/runner/work/karmada/karmada/test/e2e/propagationpolicy_test.go:134 +0x278
    reflect.Value.call({0x3cc0000?, 0xc0007066f0?, 0xc000731558?}, {0x418d421, 0x4}, {0x69a5fa0, 0x0, 0xc000854000?})
    	/opt/hostedtoolcache/go/1.21.10/x64/src/reflect/value.go:596 +0x14ce
    reflect.Value.Call({0x3cc0000?, 0xc0007066f0?, 0x13?}, {0x69a5fa0, 0x0, 0x0})
    	/opt/hostedtoolcache/go/1.21.10/x64/src/reflect/value.go:380 +0xb6
------------------------------
[SynchronizedAfterSuite] PASSED [0.048 seconds]
[SynchronizedAfterSuite] 
/home/runner/work/karmada/karmada/test/e2e/suite_test.go:170
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
[SynchronizedAfterSuite] PASSED [0.110 seconds]
[SynchronizedAfterSuite] 
/home/runner/work/karmada/karmada/test/e2e/suite_test.go:170
------------------------------

Summarizing 4 Failures:
  [INTERRUPTED] [OverridePolicy] apply overriders testing [LabelsOverrider] apply labels overrider testing [It] deployment labelsOverride testing
  /home/runner/work/karmada/karmada/test/e2e/overridepolicy_test.go:110
  [INTERRUPTED] [CronFederatedHPA] CronFederatedHPA testing Test suspend rule in CronFederatedHPA [It] Test suspend rule with CronFederatedHPA
  /home/runner/work/karmada/karmada/test/e2e/cronfederatedhpa_test.go:149
  [FAIL] [BasicPropagation] propagation testing Service propagation testing [DeferCleanup (Each)] service propagation testing
  /home/runner/work/karmada/karmada/test/e2e/framework/service.go:91
  [FAIL] [BasicClusterPropagation] propagation testing ClusterRole propagation testing [DeferCleanup (Each)] clusterRole propagation testing
  /home/runner/work/karmada/karmada/test/e2e/framework/rbac.go:174

Ran 95 of 189 Specs in 571.853 seconds
FAIL! - Interrupted by Other Ginkgo Process -- 91 Passed | 4 Failed | 0 Pending | 94 Skipped

liangyuanpeng avatar Jun 06 '24 08:06 liangyuanpeng

**CI Workflow / e2e test (v1.28.0) (pull_request) ** Successful in 40m

hi i found v1.28.0 passed. I don't know why with this e2e log output. can you dump the error in karmada-controller-manager? or some place to get it ?

caden2016 avatar Jun 13 '24 05:06 caden2016

Hi @caden2016, are you still in this PR?

XiShanYongYe-Chang avatar Jul 27 '24 09:07 XiShanYongYe-Chang