karmada icon indicating copy to clipboard operation
karmada copied to clipboard

karmada controller reconcile performs performance optimization

Open CharlesQQ opened this issue 1 year ago • 5 comments

What would you like to be added: https://github.com/karmada-io/karmada/blob/b3d3bcde6622654a1ed048fbce29e3b2d2deaaf7/pkg/detector/detector.go#L356-L387

  • In order to find the matching propagationPolicy, all propagationPolicies will be listed, and the for loop will be used to find the matching propagationPolicy in turn. The for loop here may take too long. The resource amount of pp and deployment is about 7000
  • In the function ConvertToTypedObject, runtime.DefaultUnstructuredConverter.FromUnstructured and runtime.DefaultUnstructuredConverter.FromUnstructured take a long time to perform type conversion; can this function call be removed?

Is there a better way to optimize the above problems?

Why is this needed: As shown below, The resource_match_policy_duration_seconds_bucket metric indicates that the execution took more than 0.5 seconds or even 0.9 seconds; This may cause the execution time of the resource detector controller reconcile to be too long, and the workqueue queue to create a long backlog. image image

By looking at the pprof cpu profile, we found that the function ConvertToTypedObject takes a long time to execute. image

CharlesQQ avatar Nov 06 '24 11:11 CharlesQQ

By removing the outermost for loop and the call to the function ConvertToTypedObject, after testing, it can be seen that the performance of the resource detector controller is significantly optimized.

We use a 4-core CPU machine as the master node; create 1000 deployments and pp:

Before: cost 5 minutes and 45 seconds

After: cost 2 minutes and 30 seconds image

image

CharlesQQ avatar Nov 11 '24 06:11 CharlesQQ

/assign @CharlesQQ In favor of #5802

RainbowMango avatar Nov 11 '24 10:11 RainbowMango

Adjusting the parameter --concurrent-resourcebinding-syncs from 5 to 30, the resourcebinding controller's queue backlog optimization effect is not obvious, and it takes almost 5 minutes.

The problem occurs in the ensureWork function The following is my time-consuming log for each function

  • ApplyOverridePolicies: 1.025481204+1.181936987 = 2.2074181910000004, cost accounting for 79%
    • Client.List Policies: 0.980144234 + 0.811084427, cost accounting for 64%
  • CreateOrUpdateWork: 0.426976309+0.048349991 = 0.4753263, cost accounting for 17%
I1113 14:36:11.161745      18 overridemanager.go:181] applyNamespacedOverrides, list configmap-beta-moa-demo-prod-681,  cluster member-cluster1, cost: 0.980144234
I1113 14:36:11.198787      18 overridemanager.go:217] getOverridersFromOverridePolicies  cluster: member-cluster1  name: configmap-beta-moa-demo-prod-681    cost: 300.506µs
I1113 14:36:11.206905      18 overridemanager.go:98] ApplyOverridePolicies configmap-beta-moa-demo-prod-681, cluster member-cluster1, cost: 1.025481204
I1113 14:36:11.392423      18 work.go:101] RetryOnConflict CreateOrUpdate workload configmap-beta-moa-demo-prod-681, namespace karmada-es-member-cluster1, cost 0.18521976
I1113 14:36:11.633971      18 common.go:151] CreateOrUpdateWork configmap-beta-moa-demo-prod-681,   cost: 0.426976309
I1113 14:36:12.445315      18 overridemanager.go:181] applyNamespacedOverrides, list configmap-beta-moa-demo-prod-681,  cluster member-cluster2, cost: 0.811084427
I1113 14:36:12.532025      18 overridemanager.go:217] getOverridersFromOverridePolicies  cluster: member-cluster2  name: configmap-beta-moa-demo-prod-681    cost: 4.289999ms
I1113 14:36:12.815987      18 overridemanager.go:98] ApplyOverridePolicies configmap-beta-moa-demo-prod-681, cluster member-cluster2, cost: 1.181936987
I1113 14:36:12.864306      18 work.go:101] RetryOnConflict CreateOrUpdate workload configmap-beta-moa-demo-prod-681, namespace karmada-es-member-cluster2, cost 0.048121342
I1113 14:36:12.864415      18 common.go:151] CreateOrUpdateWork configmap-beta-moa-demo-prod-681,   cost: 0.048349991
I1113 14:36:12.864438      18 common.go:153] End for range configmap-beta-moa-demo-prod-681-configmap, cost: 2.683047699
I1113 14:36:12.864466      18 binding_controller.go:133] Ensure work configmap-beta-moa-demo-prod-681-configmap, cost: 2.6830823759999998
I1113 14:36:12.864529      18 binding_controller.go:74] ResourceBinding reconcile  default/configmap-beta-moa-demo-prod-681-configmap cost: 2.790846222

if seems that deepcopy cost the most time: image

image image image

After turning off list deepcopy, the list time is reduced to 0.1s or even lower.

I1113 15:43:39.212238      17 overridemanager.go:182] applyNamespacedOverrides, list beta-moa-demo-prod-283,  cluster member-cluster1, cost: 0.040210384
I1113 15:43:39.251189      17 overridemanager.go:182] applyNamespacedOverrides, list beta-moa-demo-prod-97,  cluster member-cluster2, cost: 0.018000362
I1113 15:43:39.314731      17 overridemanager.go:182] applyNamespacedOverrides, list beta-moa-demo-prod-283,  cluster member-cluster2, cost: 0.023638913
image

CharlesQQ avatar Nov 12 '24 10:11 CharlesQQ

For the resource detector controller, by adjusting the parameter concurrent-resource-template-syncs=60, the queue backlog is reduced from 16 minutes to less than 1 minute.

CharlesQQ avatar Nov 13 '24 06:11 CharlesQQ

By setting the parameters --kube-api-qps=200 --kube-api-burst=300, bind-controller's queue backlog is reduced to 1 minute and 15 seconds. However, it is not possible to completely eliminate the client speed limit

image

CharlesQQ avatar Nov 22 '24 12:11 CharlesQQ

@CharlesQQ Could you please help to confirm if all tasks planned have been done?

/close Plese reopen it if anything left.

RainbowMango avatar Oct 15 '25 08:10 RainbowMango

@RainbowMango: Closing this issue.

In response to this:

@CharlesQQ Could you please help to confirm if all tasks planned have been done?

/close Plese reopen it if anything left.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

karmada-bot avatar Oct 15 '25 08:10 karmada-bot