karmada controller reconcile performs performance optimization
What would you like to be added: https://github.com/karmada-io/karmada/blob/b3d3bcde6622654a1ed048fbce29e3b2d2deaaf7/pkg/detector/detector.go#L356-L387
- In order to find the matching propagationPolicy, all propagationPolicies will be listed, and the for loop will be used to find the matching propagationPolicy in turn. The for loop here may take too long. The resource amount of pp and deployment is about 7000
- In the function ConvertToTypedObject,
runtime.DefaultUnstructuredConverter.FromUnstructuredandruntime.DefaultUnstructuredConverter.FromUnstructuredtake a long time to perform type conversion; can this function call be removed?
Is there a better way to optimize the above problems?
Why is this needed:
As shown below, The resource_match_policy_duration_seconds_bucket metric indicates that the execution took more than 0.5 seconds or even 0.9 seconds; This may cause the execution time of the resource detector controller reconcile to be too long, and the workqueue queue to create a long backlog.
By looking at the pprof cpu profile, we found that the function ConvertToTypedObject takes a long time to execute.
By removing the outermost for loop and the call to the function ConvertToTypedObject, after testing, it can be seen that the performance of the resource detector controller is significantly optimized.
We use a 4-core CPU machine as the master node; create 1000 deployments and pp:
Before: cost 5 minutes and 45 seconds
After: cost 2 minutes and 30 seconds
/assign @CharlesQQ In favor of #5802
Adjusting the parameter --concurrent-resourcebinding-syncs from 5 to 30, the resourcebinding controller's queue backlog optimization effect is not obvious, and it takes almost 5 minutes.
The problem occurs in the ensureWork function The following is my time-consuming log for each function
- ApplyOverridePolicies: 1.025481204+1.181936987 = 2.2074181910000004, cost accounting for 79%
- Client.List Policies: 0.980144234 + 0.811084427, cost accounting for 64%
- CreateOrUpdateWork: 0.426976309+0.048349991 = 0.4753263, cost accounting for 17%
I1113 14:36:11.161745 18 overridemanager.go:181] applyNamespacedOverrides, list configmap-beta-moa-demo-prod-681, cluster member-cluster1, cost: 0.980144234
I1113 14:36:11.198787 18 overridemanager.go:217] getOverridersFromOverridePolicies cluster: member-cluster1 name: configmap-beta-moa-demo-prod-681 cost: 300.506µs
I1113 14:36:11.206905 18 overridemanager.go:98] ApplyOverridePolicies configmap-beta-moa-demo-prod-681, cluster member-cluster1, cost: 1.025481204
I1113 14:36:11.392423 18 work.go:101] RetryOnConflict CreateOrUpdate workload configmap-beta-moa-demo-prod-681, namespace karmada-es-member-cluster1, cost 0.18521976
I1113 14:36:11.633971 18 common.go:151] CreateOrUpdateWork configmap-beta-moa-demo-prod-681, cost: 0.426976309
I1113 14:36:12.445315 18 overridemanager.go:181] applyNamespacedOverrides, list configmap-beta-moa-demo-prod-681, cluster member-cluster2, cost: 0.811084427
I1113 14:36:12.532025 18 overridemanager.go:217] getOverridersFromOverridePolicies cluster: member-cluster2 name: configmap-beta-moa-demo-prod-681 cost: 4.289999ms
I1113 14:36:12.815987 18 overridemanager.go:98] ApplyOverridePolicies configmap-beta-moa-demo-prod-681, cluster member-cluster2, cost: 1.181936987
I1113 14:36:12.864306 18 work.go:101] RetryOnConflict CreateOrUpdate workload configmap-beta-moa-demo-prod-681, namespace karmada-es-member-cluster2, cost 0.048121342
I1113 14:36:12.864415 18 common.go:151] CreateOrUpdateWork configmap-beta-moa-demo-prod-681, cost: 0.048349991
I1113 14:36:12.864438 18 common.go:153] End for range configmap-beta-moa-demo-prod-681-configmap, cost: 2.683047699
I1113 14:36:12.864466 18 binding_controller.go:133] Ensure work configmap-beta-moa-demo-prod-681-configmap, cost: 2.6830823759999998
I1113 14:36:12.864529 18 binding_controller.go:74] ResourceBinding reconcile default/configmap-beta-moa-demo-prod-681-configmap cost: 2.790846222
if seems that deepcopy cost the most time:
After turning off list deepcopy, the list time is reduced to 0.1s or even lower.
I1113 15:43:39.212238 17 overridemanager.go:182] applyNamespacedOverrides, list beta-moa-demo-prod-283, cluster member-cluster1, cost: 0.040210384
I1113 15:43:39.251189 17 overridemanager.go:182] applyNamespacedOverrides, list beta-moa-demo-prod-97, cluster member-cluster2, cost: 0.018000362
I1113 15:43:39.314731 17 overridemanager.go:182] applyNamespacedOverrides, list beta-moa-demo-prod-283, cluster member-cluster2, cost: 0.023638913
For the resource detector controller, by adjusting the parameter concurrent-resource-template-syncs=60, the queue backlog is reduced from 16 minutes to less than 1 minute.
By setting the parameters --kube-api-qps=200 --kube-api-burst=300, bind-controller's queue backlog is reduced to 1 minute and 15 seconds. However, it is not possible to completely eliminate the client speed limit
@CharlesQQ Could you please help to confirm if all tasks planned have been done?
/close Plese reopen it if anything left.
@RainbowMango: Closing this issue.
In response to this:
@CharlesQQ Could you please help to confirm if all tasks planned have been done?
/close Plese reopen it if anything left.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.