crane icon indicating copy to clipboard operation
crane copied to clipboard

reduce crane-agent's cpu expend when node's pod more than two hundred

Open wolfleave opened this issue 2 years ago • 4 comments

Describe the feature

now,when node's pod more than two hundred,crane-agent's cpu expend 2C。this is too high for a agent. image In this picture,crane-agent-bfcbj expend 2C when the node's pod number is 217. crane-agent-cjktf expend 893m when the node's pod number is 119. crane-agent-zcv26 expend 20m when the node's pod number is 13.

Expect

crane-agent expend less than 500m when node's pod number is 200.

wolfleave avatar Feb 28 '23 09:02 wolfleave

Thanks for your issues, is your agent 0.9.0? We will start to fix it next week.

chenkaiyue avatar Mar 10 '23 11:03 chenkaiyue

yes,crane-agent 0.9.0 . expect next week.

wolfleave avatar Mar 10 '23 11:03 wolfleave

yes,crane-agent 0.9.0 . expect next week.

OK

chenkaiyue avatar Mar 10 '23 14:03 chenkaiyue

image

image

Last week, we conducted a performance analysis of the crane-agent and found that most of the performance consumption is in the advisor, which is used to obtain indicators related to the pod. Currently, we have collected a very large number of indicators, although only CPU and memory related indicators are used as the watermark, and these tasks consume a lot of CPU. Regarding the comparison with 0.5.0, the current version adds a lot of collected indicators, which consumes relatively more resources;

We also discussed the solution to this problem. Next, we will consider which indicators to collect as configurable items to avoid currently collecting many indicators which not be used in the subsequent process. This will reduce indicator collection while meeting user needs.

Currently, we recommend that you reduce the collection interval of the indicator to 5 seconds, which can be adjusted to 60 seconds by using the --collect-interval parameter. In addition, you can see what functions you currently use. If you only use the watermark function, you can close the noderesource manager and podresource manager through the feature gates of NodeResource and CranePodResource, which are currently enabled by default.

I tested on a node with 230 pods, after adjusting the interval to 60 seconds, it took approximately 0.5 C.

chenkaiyue avatar Mar 20 '23 08:03 chenkaiyue