machine-controller-manager
machine-controller-manager copied to clipboard
Reduce CPU utilization
Issue
During performance tests, of all the control-plane components, the CPU utilisation of MCM was comparable to kube-apiserver and even more than etcd. This is surprising as MCM is handling at least two orders of magnitude less number of resources when compared to the other control-plane components.
Reducing the CPU utilisation will help improve the scalability of MCM as well as gardener.
Solution
Profile and optimize the CPU utilization of MCM.
I think most of the CPU utilization would be due to the reconcilation of machine objects. They get reconciled on every node object update. And a node object is updated every 10-30s due to update on condition checks like - memory, disk, network, readiness. I think we could reduce this reconciliation intervel either at the MCM or on the node level. We can take a call while fixing this issue.
I think we should also look at the number of goroutines and retry logic. But let's profile first.
Hi, we see a constantly increasing CPU usage of mcm in our environment with only 3 nodes in the cluster. pproof shows a vast amount of parked threads:
runtime.gopark
/usr/local/go/src/runtime/proc.go
Total: 29300 29300 (flat, cum) 100%
runtime.selectgo
/usr/local/go/src/runtime/select.go
Total: 2 29113 (flat, cum) 99.35%
But i must admit that i have no idea how this could happen tbh. Ideas ?
@majst01 #341 addressed the constant increase in CPU usage. Does it not work for you?
We have kept the current issue open to track the optimisation of the baseline CPU usage.
We already have #341, i will try the most recent version as well and report back.
We already have #341
Thanks. We saw improvement with #341. But this is good information for us. We will also check from our end.
I doubt that the changes upstream since #341 change any behavior here. I will also have a look for unclosed channels et.al.
I doubt that the changes upstream since #341 change any behavior here.
Yes. If you already have #341, there are no further relevant changes that might help.
I tried to run https://github.com/golangci/golangci-lint on the code base but failed actually with:
WARN [runner] Can't run linter goanalysis_metalinter: assign: failed prerequisites: [email protected]/gardener/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion
WARN [runner] Can't run linter unused: buildssa: analysis skipped: errors in package: [/home/stefan/dev/devops/cloud-native/metal/metal-pod/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion/machine_client.go:10:6: MachineInterface redeclared in this block /home/stefan/dev/devops/cloud-native/metal/metal-pod/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion/machine.go:21:6: other declaration of MachineInterface /home/stefan/dev/devops/cloud-native/metal/metal-pod/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion/machine.go:112:20: machine.Machine undefined (type *machine.Machine has no field or method Machine) /home/stefan/dev/devops/cloud-native/metal/metal-pod/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion/machine.go:97:20: machine.Machine undefined (type *machine.Machine has no field or method Machine) /home/stefan/dev/devops/cloud-native/metal/metal-pod/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion/machine.go:85:20: machine.Machine undefined (type *machine.Machine has no field or method Machine)]
We lint all of our code in CI to prevent obvious bugs, but this kind of problem never occurred.
We do link check here: https://github.com/gardener/machine-controller-manager/blob/master/.ci/check#L61 , in case it helps.
Also, what version of MCM were you rebasing/using?
We are using master